跳转至

Deployment Experience

We have deployed SkyPilot to dozens of users from 3 universities and 4 other organizations, who have been using the broker to run both adhoc and recurring batch jobs in the clouds for many months. These users have switched to the intercloud broker from their prior solutions of manually interacting with specific clouds, either via web consoles or low-level APIs. Below, we discuss our experiences with the system so far based on user feedback.

我们已将 SkyPilot 部署给来自 3 所大学和 4 个其他组织的数十位用户,这些用户已经使用该代理在云中运行临时和定期的批处理任务达数月之久。这些用户已从之前手动与特定云进行交互的解决方案(无论是通过网页控制台还是底层 API)转向使用这个跨云代理。下面,我们基于用户反馈讨论了我们对该系统的使用经验。

Benefits of an intercloud broker. By surveying our users, we found that users value the broker not only for cost reduction, but also for improved availability (see §5.2) and in general for improving their productivity. For example, users like the broker’s ability to automatically provision scarce resources across clouds or regions, the easy access to bestof-breed hardware (e.g., TPUs), and the simple packaging of existing programs. Moreover, by interacting with the broker rather than the clouds, they value the ability to run the same jobs on different clouds with no change to their code or workflow.

跨云代理的优势。通过调查我们的用户,我们发现用户不仅重视代理在成本降低方面的作用,还重视它提高可用性(见§5.2)以及整体上提高生产力的能力。例如,用户喜欢代理能够在不同云或区域之间自动分配稀缺资源、轻松访问最先进的硬件(如 TPU),以及简单地打包现有程序。此外,通过与代理交互而非直接与云交互,他们能够在不同的云上运行相同的任务,而无需更改代码或工作流程,这一点也得到了用户的高度评价。

Cluster reuse for faster development and debugging. Users have reported that the typical provisioning time of several minutes for a new cluster is too long, especially during the iterative code development phase. To alleviate this, we added the ability to reuse existing clusters for running a new application. This also helps the debugging of Sky applications as the users can log into a cluster to inspect and troubleshoot.

集群重用以加快开发和调试。用户反馈说,新的集群通常需要几分钟的配置时间,尤其是在迭代代码开发阶段,这个时间太长。为缓解这一问题,我们增加了重用现有集群的功能,以便运行新应用。这也有助于调试 Sky 应用程序,因为用户可以登录集群进行检查和故障排除。

Moving data is acceptable for many workloads. Data gravity can prevent workloads from being moved across clouds. However, we found that for many batch workloads, cross-cloud data transfers are not as slow or costly as we expected. In fact, moving data can be profitable even after factoring in the egress (Figure 5; Figure 8).

数据移动对许多工作负载是可以接受的。数据重力(Data Gravity)可能会阻碍工作负载在云之间移动。然而,我们发现对于许多批处理工作负载来说,跨云数据传输的速度和成本并没有我们预期的那么慢或昂贵。事实上,即便考虑到出口流量费,移动数据仍然是有利可图的(见图5;图8)。

There are several reasons for this. First, the computation complexity of many batch jobs, such as ML training, is typically super-linear in the input size. Second, many datasets are not excessively large. For example, a study from Microsoft reports that most production ML datasets are between 1 GB to 1 TB [75]. Our results (§5.1.1) suggest that a 1 TB dataset can likely be moved in ∼20 minutes with a cost of ∼$90. Depending on the job, this delay and cost can be easily offset by the destination offering better hardware, software, or pricing.

出现这种情况有几个原因。首先,许多批处理任务(如机器学习训练)的计算复杂度通常是输入数据量的超线性关系。其次,许多数据集并没有过于庞大。例如,微软的一项研究报告称,大多数生产环境中的机器学习数据集介于 1 GB 到 1 TB 之间[75]。我们的结果(§5.1.1)表明,一个 1 TB 的数据集可能在约 20 分钟内完成传输,费用约为 90 美元。根据具体任务,这种延迟和成本可以很容易被目标云提供的更好的硬件、软件或价格所抵消。

On-premise clusters as part of the Sky. Users have requested the support for running jobs on on-premise clusters through the broker. There are several benefits. First, this would enable users to take advantage of idle local clusters and burst to the cloud when they are overloaded. Second, the broker would offer the same interface that hides the heterogeneity (to the extent possible), so the same Sky applications could run both in the cloud and locally. Challenges include designing spillover policies and handling compatibility and storage.

将本地集群作为 Sky 的一部分。用户请求支持通过代理在本地集群上运行任务。这样做有几个好处。首先,这将使用户能够利用空闲的本地集群,并在本地集群过载时扩展到云。其次,代理将提供同一接口(尽可能隐藏异构性),使相同的 Sky 应用程序能够在云端和本地运行。挑战包括设计溢出策略以及处理兼容性和存储问题。