跳转至

QuickStart SkyPilot

official tutorial

Prequisites

Bash
1
2
mkdir hello-sky
cd hello-sky
YAML
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
<!-- hello-sky/sky.yaml -->
resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  cloud: aws
  # 8x NVIDIA A100 GPU
  accelerators: A100:8

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
  echo "Running setup."

# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
  echo "Hello, SkyPilot!"
  conda env list

To launch a cluster and run a task, use sky launch:

Bash
1
sky launch -c mycluster hello_sky.yaml
sky launch

This operation actually contains two 2 steps:

  1. Initialize and Launch a cluster.
  2. Go into this cluster and Run a task on it.
Bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
 sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Considered resources (1 node):
-----------------------------------------------------------------------------------------------
 CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
-----------------------------------------------------------------------------------------------
 Kubernetes   2CPU--2GB   2       2         -              kind-skypilot   0.00          ✔
-----------------------------------------------------------------------------------------------
Launching a new cluster 'mycluster'. Proceed? [Y/n]: y
⚙︎ Launching on Kubernetes.
⠧ Launching  View logs at: ~/sky_logs/sky-2024-10-26-00-32-04-918185/provision.log

Obviously, it may take a few minutes for the first run (init + launch + run). Just have a cup of coffee :)

In the end, the cluster will finish provisioning and the task will be executed. The outputs will show Hello, SkyPilot! and the list of installed Conda environments.

A Big Problem

here

How to Use

Execute a Task on an Existing Cluster

Bash
1
2
3
sky exec mycluster hello_sky.yaml
sky exec mycluster python train_cpu.py
sky exec mycluster --gpus=A100:8 python train_gpu.py

Go into this existing cluster and run a task on it.

View All Clusters

Bash
1
sky status

all arguments provided: here

Remote Connection

SSH Log In

Bash
1
ssh mycluster

Transfer Files

Bash
1
2
# After a task’s execution, use rsync or scp to download files
rsync -Pavz mycluster:/remote/source /local/dest  # copy from remote VM

Stop/Terminate Cluster

When you are done, stop the cluster with sky stop:

Bash
1
sky stop mycluster

To terminate a cluster instead, run sky down:

Bash
1
sky down mycluster
Difference

Stopping a cluster does not lose data on the attached disks (billing for the instances will stop while the disks will still be charged). Those disks will be reattached when restarting the cluster.

Terminating a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Terminated clusters cannot be restarted.