跳转至

Related Concepts and Recent Developments

In this section we first review two concepts related to the ability to migrate workloads – standards and multicloud – and then discuss the recent progress towards compatibility.

(1) Standards 标准 (2) Multicloud 多云系统 (3) Compatibility 兼容性

Why Not Just Adopt Standards?

The first question one might ask is if seamless migration is the goal, why not adopt a set of uniform and comprehensive cloud standards, as was done for the Internet and cellular? In fact, a decade ago IEEE proposed a set of Intercloud standards for portability, interoperability, and federation among cloud providers [88] involving an Intercloud Service Catalog and an Intercloud federation layer. There are two fundamental problems with this and other proposals for such uniform and comprehensive cloud standards. First, there is no incentive for the dominant clouds (i.e., those with large market shares) to adopt such standards; it would decrease their competitive advantage and make it easier for customers to move their business to other clouds. Second, users interact with clouds at many levels, using high-level service interfaces such as PyTorch [76] or TensorFlow [53] in addition to low-level orchestration interfaces such as Kubernetes [36]. If the goal is to make workload migration seamless, then all of these interfaces would need to be standardized. Requiring every cloud to standardize every interface is both unrealistic (as noted in the first objection) and unwise (because these higher-level interfaces have changed significantly over time, and standardizing them would greatly hinder innovation).

第一个问题可能是,如果无缝迁移是目标,为什么不采用一套统一且全面的云标准,就像互联网和蜂窝通信所做的那样?事实上,十年前,IEEE曾提出一套云之间的互操作性、可移植性和联合的Intercloud标准[88],涉及一个Intercloud服务目录和一个Intercloud联合层。然而,这些统一全面的云标准提议存在两个根本性问题。首先,主导性的云提供商(即拥有大量市场份额的云服务)没有动力采用这些标准,因为这会削弱它们的竞争优势,并让客户更容易将业务迁移到其他云服务商。其次,用户在多个层面与云进行交互,除了使用像Kubernetes[36]这样的低级编排接口之外,还使用诸如PyTorch[76]或TensorFlow[53]等高级服务接口。如果目标是实现工作负载的无缝迁移,那么所有这些接口都需要标准化。要求每个云服务商对所有接口进行标准化既不现实(正如第一个反对意见所指出的),也不明智(因为这些高级接口随着时间的推移发生了显著变化,标准化它们将极大地阻碍创新)。

No Unified Standard

我们不采用统一标准,因为:

1)大型云服务商不希望破坏自身垄断地位

2)云平台交互分为多个层次,每个层级的接口全部统一,不现实也不明智

Why Isn’t This Just Multicloud

Multicloud is now an industry buzzword, and there are reports [33,52] that most enterprises have, or will soon have, multicloud deployments; this would seemingly imply that our goal of seamless workload migration has already been realized. However, the common use of the term multicloud only requires that an enterprise have workloads on two or more clouds (e.g., the finance team runs their backend functions on Amazon while the analytics team runs their ML jobs on Google), not that they can easily move those workloads between clouds. It is clear, from everyone we have talked to in the industry, that moving many workloads between clouds remains difficult. The exceptions to this are the recent third-party offerings (e.g., by Trifacta, Confluent, Snowflake, Databricks, and others) that run on multiple clouds; users can indeed migrate their workloads that only use these services between clouds relatively easily (BigQuery, offered by Google, offers similar cross-cloud support). However, these are for specific workloads, and do not provide general support for workload migration.

“多云”现在已成为行业热词,有报告指出[33,52],大多数企业已经或即将部署多云架构;这似乎意味着我们实现了无缝工作负载迁移的目标。然而,“多云”这一术语的普遍使用仅意味着企业在两个或更多云上运行工作负载(例如,财务团队在亚马逊上运行其后端功能,而分析团队在谷歌上运行机器学习任务),而不是说它们能够轻松地在这些云之间迁移工作负载。我们与行业中的许多人交流后发现,跨云迁移大量工作负载依然十分困难。例外情况是最近的一些第三方服务(例如Trifacta、Confluent、Snowflake、Databricks等),它们可以在多个云上运行;用户确实可以相对容易地在云之间迁移仅使用这些服务的工作负载(谷歌提供的BigQuery也提供类似的跨云支持)。然而,这些仅适用于特定工作负载,并不提供一般性的工作负载迁移支持。

In addition, there are several programming or management frameworks that support multiple clouds. JClouds [8] and Libcloud [10] offer portable abstractions over the compute, storage, and other services of many providers. However, the user still does the placement manually, whereas automatic placement is a key feature of Sky Computing. On the management front, Terraform [51] provisions and manages resources on different clouds, but requires the usage of provider-specific APIs, and also does not handle job placement. Kubernetes [36] orchestrates containerized workloads and can be run across multiple clouds (e.g., Anthos [5]). These frameworks, while quite valuable, focus on providing more compatibility in the lower-level infrastructure interfaces offered by the clouds (see §2.3), and as such are nicely complementary with Sky Computing but do not obviate the need for Sky Computing.

此外,还有多个支持多云的编程或管理框架。JClouds [8] 和 Libcloud [10] 提供了多个云服务提供商在计算、存储等服务上的可移植抽象。然而,用户仍需手动进行任务部署,而自动部署是天空计算的关键特性。在管理方面,Terraform [51] 能在不同云平台上配置和管理资源,但需要使用特定于提供商的API,且不处理任务部署。Kubernetes [36] 能编排容器化工作负载,并可以跨多个云运行(如Anthos [5])。这些框架虽然非常有价值,主要专注于提供云服务的低级基础设施接口的兼容性(见第2.3节),因此它们与天空计算形成了良好的互补关系,但并不能替代天空计算的必要性。

Multi-Cloud Terminology

“多云”这一术语的普遍使用仅意味着企业在两个或更多云上运行工作负载(例如,财务团队在亚马逊上运行其后端功能,而分析团队在谷歌上运行机器学习任务),而不是说它们能够轻松地在这些云之间迁移工作负载。

部分云服务提供商在计算、存储等服务上的可移植抽象。然而,用户仍需手动进行任务部署,而自动部署是天空计算的关键特性。

Growth In Interface Compatibility

Turning away from related concepts, we now discuss a recent development that Sky Computing will leverage. As noted before, users of cloud computing invoke a wide variety of computational and management interfaces. Many of these are open source systems that have become the de facto standards at different layers of the software stack, including operating systems (Linux), cluster resource managers (Kubernetes [36], Apache Mesos [63]), application packaging (Docker [27]), databases (MySQL [41], Postgres [43]), big data execution engines (Apache Spark [93], Apache Hadoop [89]), streaming engines (Apache Flink [57], Apache Spark [93], Apache Kafka [9]), distributed query engines and databases (Cassandra [7], MongoDB [39], Presto [44], SparkSQL [48], Redis [45]), machine learning libraries (PyTorch [76], TensorFlow [53], MXNet [58], MLflow [38], Horovod [79], Ray RLlib [66]), and general distributed frameworks (Ray [71], Erlang [55], Akka [1]). In addition, some of AWS’s interfaces are increasingly being supported on other clouds: Azure and Google provide S3-like APIs for their blob stores to make it easier for customers to move from AWS to their own clouds. Similarly, APIs for managing machine images and private networks are converging.

转向相关概念之外,我们现在讨论天空计算将利用的一项最新发展。如前所述,云计算的用户调用各种各样的计算和管理接口。许多接口是已成为不同软件栈层级上事实标准的开源系统,包括操作系统(Linux)、集群资源管理器(Kubernetes [36]、Apache Mesos [63])、应用打包(Docker [27])、数据库(MySQL [41]、Postgres [43])、大数据执行引擎(Apache Spark [93]、Apache Hadoop [89])、流处理引擎(Apache Flink [57]、Apache Spark [93]、Apache Kafka [9])、分布式查询引擎和数据库(Cassandra [7]、MongoDB [39]、Presto [44]、SparkSQL [48]、Redis [45])、机器学习库(PyTorch [76]、TensorFlow [53]、MXNet [58]、MLflow [38]、Horovod [79]、Ray RLlib [66])以及通用分布式框架(Ray [71]、Erlang [55]、Akka [1])。此外,AWS的一些接口正逐渐在其他云上得到支持:Azure和Google提供类似S3的API,用于它们的对象存储服务,以便客户更容易从AWS迁移到它们自己的云平台。同样,管理虚拟机镜像和私有网络的API也在趋同。

翻译成人话

这些不同领域的常见开源框架,在某种程度上成为对应领域Interface的范式

These trends increase what we call limited interface compatibility, where both of these qualifiers are crucial. This compatibility applies only to individual interfaces and these interfaces are typically not supported by all clouds but by more than one. Our contention, based on what we see in the ecosystem, is that the number and the usage of these interfaces that have this limited compatibility – i.e., are supported on more than one cloud – is increasing, largely but not exclusively due to open-source efforts.

这些趋势促成了我们所称的有限接口兼容性,其中“有限”和“接口”这两个修饰词都至关重要。这种兼容性仅适用于某些具体的接口,并且这些接口通常并非所有云平台都支持,但会被多个云平台支持。根据我们对生态系统的观察,我们主张,具有这种有限兼容性的接口——即被多个云平台支持的接口——的数量和使用率正在增加,这在很大程度上(但并非完全)归功于开源努力。

We are basing our approach on the belief that this trend will continue, and that leveraging this trend is far preferable to pursuing uniform and comprehensive standards. To paraphrase a quote attributed to Lincoln, we know that all interfaces are supported by some clouds, and some interfaces may be supported by all clouds, but we cannot and should not require that all interfaces be supported by all clouds.

我们的方法基于一个信念:这一趋势将会持续,而利用这一趋势远优于追求统一且全面的标准。借用林肯的一句名言,我们知道所有接口在某些云平台上都有支持,有些接口可能会在所有云平台上得到支持,但我们不可能也不应该要求所有接口都在所有云平台上得到支持。