The Vision of Sky Computing¶

We first describe what Sky Computing is, and articulate why we see it as not just tactical (战术性) but transformative (变革性).

What Is Sky Computing¶

Given this increasing level of limited interface compatibility, how do we leverage it to ease workload migration? There are two key components. First, in order to reduce data gravity, clouds can enter into reciprocal free data peering; i.e., two clouds can agree to let users move data from one cloud to another without charge. With high-speed connections prevalent (many clouds have 100 Gbps connections to various interconnection points where they can peer with other clouds), we think such free peering can easily be supported, with its costs more than offset by the increase in computational revenue that it enables. One might worry about the delay that such transfers incur, but if the resulting computation times are superlinear in the data size (or linear with a reasonably high constant) then no matter how large datasets become, the networking delays will not be a major bottleneck.

鉴于这种有限接口兼容性的不断提升，我们如何利用它来简化工作负载迁移？关键有两个方面。首先，为了减少“数据引力”，云服务提供商可以建立互惠的免费数据对等互联协议；即，两家云服务商可以同意让用户在它们之间免费移动数据。随着高速连接的普及（许多云服务商拥有100 Gbps的连接，能够在各种互联点与其他云进行对等互联），我们认为这种免费对等互联可以轻松实现，其成本会被由此带来的计算收入增加所抵消。有人可能会担心此类传输带来的延迟问题，但如果由此产生的计算时间相对于数据大小是超线性的（或线性且常数较大），那么无论数据集有多大，网络延迟都不会成为主要瓶颈。

Note

正如上述提及的，limited interface compatibility趋势不断增加，为不同云之间工作负载迁移奠定了坚实的基础

这里的第一点就是：云服务提供商可以建立互惠的免费数据对等互联协议以减少“数据引力”

回顾下“数据引力”，表示负载数据很容易进入一片云，但是“出云”的成本很高很高

我们认为这种免费对等互联可以轻松实现，其成本会被由此带来的计算收入增加所抵消 :)

如果由此产生的计算时间相对于数据大小是超线性的（或线性且常数较大），那么无论数据集有多大，网络延迟都不会成为主要瓶颈：由于是随着数据量超线性增长，这个计算时间一定是超级无穷大，网络延迟跟这个比根本不算啥

The second component, and the one we focus on for the rest of this paper, is what we call intercloud brokers. In this paper we describe our intercloud broker, which is designed specifically for computational batch jobs (§4). While batch jobs (e.g., ML, scientific jobs, data analytics) represent only a fraction of today’s diverse cloud use cases, their computation demands are growing quickly [74] and are responsible for the recent surge of specialized hardware [15, 22, 23]. Thus, we have started with a broker designed for batch jobs as a tractable but common and rapidly growing workload. We expect future versions of the broker will address a wider range of workloads, and provide a broader set of features, but that is not our focus here. In addition, we expect that eventually there will be an open market in intercloud brokers that charge a small fee for their brokerage service; some of those brokers will be general purpose and others more tailored to specific workloads, as ours is.

第二个方面，也是我们在本文中重点讨论的，是我们称之为“跨云代理”的概念。本文描述了我们为计算批处理任务专门设计的跨云代理（第4节）。虽然批处理任务（例如机器学习、科学计算、数据分析）仅占据了今天多样化云应用场景的一部分，但它们的计算需求正在快速增长[74]，并且推动了最近专用硬件的激增[15, 22, 23]。因此，我们从一个专为批处理任务设计的代理开始，这是一种可处理但常见且快速增长的工作负载。我们预计未来版本的代理将处理更广泛的工作负载，并提供更多的功能，但这不是我们在这里的重点。此外，我们预计最终会出现一个开放的跨云代理市场，这些代理将对其中介服务收取小额费用；其中一些代理将是通用的，而另一些则会更针对特定工作负载，如我们的代理一样。

An intercloud broker takes as input a computational request that is is specified as a directed acyclic graph (DAG) in which the nodes are coarse-grained computations (e.g., data processing, training). For lack of a better term we call these computations “tasks”. The request also includes the user’s preferences about price and performance.

跨云代理接收的输入是一个指定为有向无环图（DAG）的计算请求，其中节点表示粗粒度的计算（例如，数据处理、训练）。由于没有更合适的术语，我们将这些计算称为“任务”。该请求还包含用户对价格和性能的偏好。

The intercloud broker is then responsible for placing these tasks across clouds. Unlike existing multicloud applications which run an application instance per cloud, an intercloud broker can run a single application instance across several clouds. For example, Figure 1 shows a machine learning (ML) pipeline with three tasks: data processing, training, and serving. The user may wish to minimize the total cost while processing data securely. The intercloud broker might decide to run data processing on Azure Confidential Computing [16] to anonymize data and thus protect data confidentiality, training on GCP to take advantage of TPUs [23], and serving on AWS to take advantage of the Inferentia accelerator [15].

跨云代理负责将这些任务分配到不同的云上。与现有的多云应用程序每个云上运行一个应用实例不同，跨云代理可以在多个云上运行一个单一的应用实例。例如，图1展示了一个包含三个任务的机器学习（ML）管道：数据处理、训练和服务。用户可能希望在安全地处理数据的同时最小化总体成本。跨云代理可能决定在Azure Confidential Computing [16] 上运行数据处理，以匿名化数据从而保护数据机密性，在GCP上进行训练以利用TPU [23]，并在AWS上进行服务以利用Inferentia加速器 [15]。

alt text

The ability to partition applications enables the emergence of specialized clouds. For example, a cloud provider can build a successful business by just focusing on a single task, such as ML training, and offering the best price-performance for that task; see §A.1 for a more detailed discussion of this.

应用程序的分区能力促进了专门化云服务的出现。例如，云服务提供商可以通过专注于单一任务（如机器学习训练）并为该任务提供最佳的性价比来建立成功的业务；有关此方面的更详细讨论，请参见第A.1节。

Intercloud Brokers

重要组件：Intercloud Brokers

输入的计算请求：有向无环图（DAG），topology图节点表示粗粒度的计算，我们称之为“task”

跨云代理负责将这些任务分配到不同的云，它可以在__多个云上运行一个单一的应用实例__

单一的实例分为多个不同的task，Intercloud Brokers可以将不同的task分配到不同的云上，类似“多任务自动化流水线”

这样在实际应用中对形成多个“专有云服务”，指定的某个云服务提供商可以专注单一任务，并为这个服务提供最佳的性价比

In addition, the intercloud broker provides benefits even when the application (i) entirely runs on a single cloud, by automatically choosing the cloud that best matches the user’s preferences and choosing the best region and zone within that cloud, or (ii) uses services 3 provided only by a single cloud, by placing a task on that cloud but still having the freedom to use other clouds for the other tasks.

此外，跨云代理即使在以下情况下也能提供好处：

（i）当应用程序完全在单一云上运行时，它可以自动选择最符合用户偏好的云，并选择该云内的最佳区域和可用区；

（ii）当应用程序仅使用单一云提供的服务时，它可以将任务放置在该云上，但仍有自由在其他云上处理其他任务。

Single Cloud Situation

场景：现在我人为规定，这个application只可以在单一云上运行

1）Intercloud Brokers 可以自动选择最符合用户偏好的云，并选择该云内的最佳区域和可用区

2）Intercloud Brokers 将任务放置在该云后，仍有自由在其他云上处理其他任务

Why Is This Transformational¶

There are three reasons, each from a different perspective, why we see this as a transformational change in cloud computing, not as merely a tactical mechanism for workload migration.

为什么这项 SkyComputing 既具有战术性，又具有变革性

alt text

User’s Perspective: When using an intercloud broker, users are no longer interacting with individual clouds, but with a more integrated “Sky” of computing. They merely specify their computation and their criteria, and the broker then places the job. This makes it significantly easier to use the cloud, and may lead to increased cloud adoption. Note that such an interface hides the heterogeneity between and within clouds. Users no longer need to research which clouds have the best prices, or offer a particular service. This also applies within individual clouds, because different regions within a cloud can offer different hardware options and different prices.

用户视角：使用跨云代理时，用户不再直接与单个云服务交互，而是与一个更为集成的“天空”计算环境进行交互。他们只需指定计算任务和标准，代理则负责任务的部署。这使得使用云计算变得显著更简单，并可能促使云计算的采纳增加。需要注意的是，这种接口隐藏了云之间和云内部的异质性。用户不再需要研究哪些云服务提供最佳价格或提供特定服务。这也适用于单个云内部，因为同一云中的不同区域可能提供不同的硬件选项和价格。

User’s Perspective

用户不再直接与单个云服务交互，而是与一个更为集成的“天空”计算环境进行交互
这种跨云代理隐藏了云之间和云内部的异质性
跨云代理这种思想/概念，可以类比推理到一个云内，同一云中的不同区域

Competitive Perspective: Note that by serving as an intermediary between users and clouds, the intercloud broker is creating a ﬁne-grained two-sided market for computation: users specify their tasks and requirements, and clouds offer their interfaces with their pricing and performance. Job placement is no longer driven mostly by measures to promote lock-in (e.g., proprietary interfaces and data gravity), but increasingly by the ability of each cloud to meet the user’s requirements through faster and/or more cost-efﬁcient implementations. This means that the clouds, in order to increase their market, will likely start supporting interfaces that are commonly used in jobs, driving the market towards increased compatibility.

竞争视角：作为用户与云之间的中介，跨云代理创建了一个细粒度的双边计算市场：用户指定他们的任务和需求，而云服务提供商则提供其接口及定价和性能。任务的部署不再主要受限于促进锁定（例如，专有接口和数据引力）的措施，而是越来越依赖于每个云服务是否能够通过更快速和/或更具成本效益的实现来满足用户的需求。这意味着，为了扩大市场，各云服务提供商可能会开始支持在任务中常用的接口，推动市场向更高的兼容性发展。

Ecosystem Perspective: Once there is a two-sided market established, the cloud ecosystem can transition from one in which all clouds offer a broad set of services and try their best to lock customers in, to one in which many clouds focus on becoming part of a computational Sky, where they can specialize in certain tasks because the intercloud broker will automatically direct computations to them if they best meet user needs for those particular tasks; the economic analysis in the appendix (§A.1.2) makes this case more precisely.

生态系统视角：一旦建立了双边市场，云生态系统可以从一个所有云服务提供广泛服务并尽力锁定客户的环境，过渡到一个许多云服务专注于成为计算天空的一部分的环境。在这种环境中，云服务可以专注于特定任务，因为跨云代理会自动将计算任务导向那些最能满足用户特定任务需求的云服务；附录中的经济分析（第A.1.2节）对此进行了更为精确的论证。

生态系统理解

许多云服务专注于成为计算天空的一部分
云服务可以专注于特定任务，跨云代理会自动将计算任务导向那些最能满足用户特定任务需求的云服务

This vision should be tempered with several doses of reality. First, while we envision some clouds will embrace the vision of Sky Computing by focusing on compatible interfaces and adopting reciprocal free data peering, we expect others, particularly those with dominant market positions, to continue with lock-in as a market strategy. Nonetheless, the presence of a viable alternative cloud ecosystem will set the bar for innovation and meeting user requirements, so all users will benefit. Second, we assume that the creation of Sky Computing will be a lengthy process that will start slowly and gradually gather momentum. Our goal in this paper is to investigate how to start this transformation, not to define its ultimate form. As such, we start with with an intercloud broker for batch jobs—a small but important set of workloads. Third, given our focus on the early stages of the Sky, we do not provide solutions to several problems that must eventually be addressed, such as how to troubleshoot failures that occur with applications running across multiple clouds.

这个愿景应当以现实的态度来看待。首先，尽管我们设想一些云服务商将通过关注兼容的接口和采用互惠免费数据对等互联来接受天空计算的愿景，我们预计其他一些云服务商，特别是那些市场主导地位明显的，可能会继续采用锁定策略作为市场战略。尽管如此，一个可行的替代云生态系统的存在将设立创新和满足用户需求的标准，从而使所有用户受益。其次，我们假设天空计算的创建将是一个漫长的过程，开始时会较慢，逐渐积累动力。我们在本文中的目标是探讨如何启动这一转型，而不是定义其最终形式。因此，我们从一个针对批处理任务的跨云代理开始——这是一个小而重要的工作负载集合。第三，鉴于我们关注的是天空计算的早期阶段，我们未提供解决必须最终解决的若干问题的方案，例如如何排查在多个云上运行的应用程序中发生的故障。