SkyPilot: An Intercloud Broker for Sky Computing¶
Abstract¶
To comply with the increasing number of government regulations about data placement and processing, and to protect themselves against major cloud outages, many users want the ability to easily migrate their workloads between clouds. In this paper we propose doing so not by imposing uniform and comprehensive standards, but by creating a fine-grained two-sided market via an intercloud broker. These brokers will allow users to view the cloud ecosystem not just as a collection of individual and largely incompatible clouds but as a more integrated Sky of Computing. We describe the design and implementation of an intercloud broker, named SkyPilot, evaluate its benefits, and report on its real-world usage.
为了遵守日益增多的关于数据存放和处理的政府法规,并保护自身免受重大云服务中断的影响,许多用户希望能够轻松地在不同云平台之间迁移其工作负载。本文提出的解决方案并非通过实施统一且全面的标准,而是通过跨云代理创建一个细粒度的双边市场。这些代理将使用户能够将云生态系统视为一个更加集成的"计算之天"(Sky of Computing),而非仅仅是一系列独立且大多不兼容的云平台集合。本文描述了一个名为SkyPilot的跨云代理的设计和实现,评估了其优势,并报告了其在实际应用中的使用情况。
Note
- easily migrate their workloads between clouds
- an intercloud broker, named SkyPilot
Introduction¶
The modern information infrastructure is built around three components. The Internet provides end-to-end network connectivity, cellular telephony provides nearly ubiquitous user access via increasingly powerful handsets, and cloud computing makes scalable computation available to all. These ecosystems obviously have many superficial differences, but perhaps their most fundamental difference lies in the degree of compatibility between providers in each of these ecosystems.
现代信息基础设施围绕三个组成部分构建:互联网提供端到端的网络连接,蜂窝电话通过日益强大的手机提供几乎无处不在的用户接入,云计算使得可扩展的计算能力对所有人可用。这些生态系统显然有许多表面上的差异,但它们之间最根本的差异可能在于每个生态系统中的服务提供商之间的兼容程度。
The Internet and the cellular infrastructure were designed with the goal of universal reachability. This required both uniform and comprehensive industry standards and broadly adopted interconnection agreements (for Internet peering and cellular roaming) that led to a globally connected federation of competing providers. The cloud ecosystem has very different origins, emerging as a replacement for dedicated on-premise computing clusters rather than serving as an interconnected communication infrastructure. As a result, cloud providers began by emphasizing their differences rather than their similarities; though the clouds are all based on the same basic conceptual units (e.g., VMs, containers, and now FaaS), they initially differed greatly in their orchestration interfaces. These orchestration interfaces have become more similar over time, but some clouds continue to differentiate themselves through numerous proprietary service interfaces, such as for storage or key-value stores. In addition, clouds typically impose much higher charges on data leaving than on data entering, resulting in “data gravity” (i.e., the difficulty of moving jobs to another cloud due to the expense of transferring the data). The combination of proprietary service interfaces and data gravity have led to significant customer lock-in: it is hard for companies who have established their computational workloads on one cloud to move them to another.
互联网和蜂窝基础设施的设计目标是实现普遍的可达性。这需要统一且全面的行业标准,以及广泛采用的互连协议(如互联网对等和蜂窝漫游),从而形成了一个全球互联的竞争性提供商联盟。云计算生态系统则起源于不同的背景,它是作为专用本地计算集群的替代品出现的,而不是作为一个互联的通信基础设施。因此,云服务提供商起初更强调它们之间的差异而非相似性;尽管各个云计算服务都是基于相同的基本概念单元(例如虚拟机、容器以及现在的无服务器计算FaaS),但它们最初的编排接口存在很大的不同。随着时间的推移,这些编排接口变得越来越相似,但一些云服务仍通过许多专有的服务接口(如存储或键值存储)进行差异化。此外,云计算通常对流出的数据收取的费用远高于流入的数据,导致“数据引力”(即由于数据传输成本高,难以将工作负载转移到另一个云平台)。专有服务接口和数据引力的结合导致了显著的客户锁定:公司在一个云平台上建立的计算工作负载很难转移到另一个平台。
Common Sense of Cloud Computing
- The cloud ecosystem has very different origins, emerging as a replacement for dedicated on-premise computing clusters rather than serving as an interconnected communication infrastructure.
- Some clouds continue to differentiate themselves through numerous proprietary service interfaces
- Really hard for companies who have established their computational workloads on one cloud to move them to another.
目前的重大问题是:将服务负载在不同云之间进行迁移的成本太高!
However, as cloud computing has become a critical part of our computational infrastructure, enterprises are increasingly worried about how difficult it is to migrate workloads between clouds. There are two compelling reasons for wanting more freedom in workload placement. First, no business wants any critical part of their infrastructure tied to a single provider because such lock-in reduces their negotiating leverage and also makes the business vulnerable to large-scale outages at the provider. Second, there are now strict regulations about data and operational sovereignty that dictate where data can be stored and computational jobs run. Not all cloud providers have datacenters in all countries, so the inability to migrate jobs between cloud providers could be a painful roadblock to satisfying these new regulations. These two reasons are not theoretical problems whose solutions would be “nice-tohave”; the recent occurrence of large-scale cloud outages and the increasing number of government regulations are quickly making such a solution a “must-have” for large-scale users of the cloud. This paper is about how we can ease the migration of workloads through the rise of Sky Computing, a concept first introduced in [81] but significantly extended and more deeply explored here. Sky Computing is when users, rather than directly interacting with the cloud, submit their jobs to what we call intercloud brokers who handle the placement and oversee the execution of their jobs.
然而,随着云计算成为我们计算基础设施的关键部分,企业越来越担心在云服务之间迁移工作负载的难度。希望在工作负载部署上获得更多自由有两个强有力的理由。首先,没有任何企业希望其基础设施的关键部分依赖于单一提供商,因为这种锁定会削弱他们的谈判筹码,并且让企业容易受到提供商大规模故障的影响。其次,关于数据和运营主权的严格法规越来越多,规定了数据存储和计算任务运行的位置。并非所有云服务提供商都在每个国家设有数据中心,因此无法在云提供商之间迁移任务可能成为满足这些新法规的重大障碍。这两个问题并不是理论上的,它们的解决方案不仅仅是“可有可无”;最近发生的大规模云宕机事件以及政府法规的日益增多,正在迅速使这种解决方案成为云计算大规模用户的“必需品”。本文将探讨如何通过天空计算(Sky Computing)的兴起来简化工作负载的迁移。天空计算这一概念最早在[81]中被提出,但本文对其进行了显著扩展和深入探索。天空计算的核心是用户不直接与云交互,而是将任务提交给我们称为“跨云代理”的中介机构,这些代理负责任务的部署和执行监督。
Review: Sky Computing
- ease the migration of workloads through the rise of Sky Computing
- Sky Computing is: users, rather than directly interacting with the cloud, submit their jobs to what we call intercloud brokers who handle the placement and oversee the execution of their jobs.
To explain our approach in more depth, we:
- first review related concepts and recent developments (§2).
- We then (§3) describe our vision of Sky Computing and its transformative possibilities.
- We present the requirements, architecture, and implementation of an intercloud broker, named SkyPilot, that focuses on computational batch jobs (§4).
- We then demonstrate its benefits on several applications (§5).
- Finally, we share our experiences with early deployments (§6), survey related work (§7), and conclude (§8).
While the body of this paper is devoted the technical characteristics of our system, in the appendix (§A.1) we speculate on how the cloud ecosystem might evolve once Sky Computing is more widely adopted.
为了更深入地解释我们的方法,我们首先回顾相关概念和最新发展(第2节)。接着(第3节),我们描述了天空计算的愿景及其变革性可能性。我们提出了一个专注于计算批处理任务的跨云代理 —— SkyPilot的需求、架构和实现(第4节)。然后,我们展示了它在多个应用中的优势(第5节)。最后,我们分享了早期部署的经验(第6节),回顾了相关工作(第7节),并作出总结(第8节)。虽然本文主体部分集中讨论了系统的技术特性,但在附录(A.1节)中,我们推测了当天空计算被广泛采用后云生态系统可能如何演变。
SkyPilot is open source and available at here.