跳转至

Log Implementation

Week 1

Origins of Offering Log

alt text

Code-Based Observation

sky/cli.py

We find 3 types of logging cmds in SkyPilot:

Bash
1
2
3
4
5
6
7
8
# sky/cli.py

# Tail the controller logs of a service
sky serve logs --controller [SERVICE_NAME]
# Print the load balancer logs so far and exit
sky serve logs --load-balancer --no-follow [SERVICE_NAME]
# Tail the logs of replica 1
sky serve logs [SERVICE_NAME] 1

It is stored in:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# sky/cli.py

def serve_logs(
    service_name: str,
    follow: bool,
    controller: bool,
    load_balancer: bool,
    replica_id: Optional[int],
):
    ......

Then we find the implementation of serve_logs() is conducted by serve_lib.tail_logs():

Python
1
2
3
4
5
6
# sky/cli.py

serve_lib.tail_logs(service_name,
                    target=target_component,
                    replica_id=replica_id,
                    follow=follow)

Hence, we jump to:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# sky/serve/core.py

@usage_lib.entrypoint
def tail_logs(
    service_name: str,
    *,
    target: Union[str, serve_utils.ServiceComponent],
    replica_id: Optional[int] = None,
    follow: bool = True,
) -> None:

    ......

sky/serve/core.py

And we find a function tail_serve_logs() related to serve logging in tail_logs():

Python
1
2
3
4
5
6
7
8
9
# sky/serve/core.py

backend = backend_utils.get_backend_from_handle(handle)
assert isinstance(backend, backends.CloudVmRayBackend), backend
backend.tail_serve_logs(handle,
                        service_name,
                        target,
                        replica_id,
                        follow=follow)

Then we come to:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# sky/backends/cloud_vm_ray_backend.py

def tail_serve_logs(self, handle: CloudVmRayResourceHandle,
                    service_name: str, target: serve_lib.ServiceComponent,
                    replica_id: Optional[int], follow: bool) -> None:
    """Tail the logs of a service.

    Args:
        handle: The handle to the sky serve controller.
        service_name: The name of the service.
        target: The component to tail the logs of. Could be controller,
            load balancer, or replica.
        replica_id: The replica ID to tail the logs of. Only used when
            target is replica.
        follow: Whether to follow the logs.
    """

sky/backends/cloud_vm_ray_backend.py

Here we see:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# sky/backends/cloud_vm_ray_backend.py

if target != serve_lib.ServiceComponent.REPLICA:
    code = serve_lib.ServeCodeGen.stream_serve_process_logs(
        service_name,
        stream_controller=(
            target == serve_lib.ServiceComponent.CONTROLLER),
        follow=follow)
else:
    assert replica_id is not None, service_name
    code = serve_lib.ServeCodeGen.stream_replica_logs(
        service_name, replica_id, follow)

We trace the code_generation functions, namely serve_lib.ServeCodeGen.stream_serve_process_logs() and serve_lib.ServeCodeGen.stream_replica_logs() (Details omitted for brevity).

And we realize they just return the cmd in CLI in string form.

Then it shows that tail_serve_logs() will use this run_on_head() to run the cmd above.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# sky/backends/cloud_vm_ray_backend.py

self.run_on_head(
    handle,
    code,
    stream_logs=True,
    process_stream=False,
    ssh_mode=command_runner.SshMode.INTERACTIVE,
    stdin=subprocess.DEVNULL,
)

Hence we go to self.run_on_head():

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# sky/backends/cloud_vm_ray_backend.py

@timeline.event
def run_on_head(
    self,
    handle: CloudVmRayResourceHandle,
    cmd: str,
    *,
    port_forward: Optional[List[int]] = None,
    log_path: str = '/dev/null',
    stream_logs: bool = False,
    ssh_mode: command_runner.SshMode = command_runner.SshMode.
    NON_INTERACTIVE,
    under_remote_workdir: bool = False,
    require_outputs: bool = False,
    separate_stderr: bool = False,
    process_stream: bool = True,
    source_bashrc: bool = False,
    **kwargs,
) -> Union[int, Tuple[int, str, str]]:
    ......

We notice head_runner.run() which is used to run cmd:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# sky/backends/cloud_vm_ray_backend.py

return head_runner.run(
    cmd,
    port_forward=port_forward,
    log_path=log_path,
    process_stream=process_stream,
    stream_logs=stream_logs,
    ssh_mode=ssh_mode,
    require_outputs=require_outputs,
    separate_stderr=separate_stderr,
    source_bashrc=source_bashrc,
    **kwargs,
)

We search the head_runner.run globally:

alt text

Click .run and we jump to:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# sky/utils/command_runner.py

class CommandRunner:

    ......

    @timeline.event
    def run(
            self,
            cmd: Union[str, List[str]],
            *,
            require_outputs: bool = False,
            # Advanced options.
            log_path: str = os.devnull,
            # If False, do not redirect stdout/stderr to optimize performance.
            process_stream: bool = True,
            stream_logs: bool = True,
            ssh_mode: SshMode = SshMode.NON_INTERACTIVE,
            separate_stderr: bool = False,
            connect_timeout: Optional[int] = None,
            source_bashrc: bool = False,
            skip_lines: int = 0,
            **kwargs) -> Union[int, Tuple[int, str, str]]:
        """Runs the command on the cluster.

        Args:
            cmd: The command to run.
            require_outputs: Whether to return the stdout/stderr of the command.
            log_path: Redirect stdout/stderr to the log_path.
            stream_logs: Stream logs to the stdout/stderr.
            ssh_mode: The mode to use for ssh.
                See SSHMode for more details.
            separate_stderr: Whether to separate stderr from stdout.
            connect_timeout: timeout in seconds for the ssh connection.
            source_bashrc: Whether to source the ~/.bashrc before running the
                command.
            skip_lines: The number of lines to skip at the beginning of the
                output. This is used when the output is not processed by
                SkyPilot but we still want to get rid of some warning messages,
                such as SSH warnings.


        Returns:
            returncode
            or
            A tuple of (returncode, stdout, stderr).
        """
        raise NotImplementedError

    ......

sky/utils/command_runner.py

Here we can find it offering the template of this run function but not implemented. 👀

So maybe we should finish this part?

TLDR

Origins of Offering Log

I have understood the entire proposed framework for the serve log based on issues #2914, #2917, #2949, and #3063.

And now #2914 and #3063 are still open, maybe we should work on them?

Problems about cmd running

Following the cmd related to sky serve logs on website:

Bash
1
2
3
sky serve logs vicuna 1 # tail logs of replica 1, including provisioning and running logs
sky serve logs vicuna --controller # tail controller logs
sky serve logs vicuna --load-balancer --no-follow # print the load balancer logs so far, and exit
Bash
1
2
3
4
5
6
7
8
# sky/cli.py

# Tail the controller logs of a service
sky serve logs --controller [SERVICE_NAME]
# Print the load balancer logs so far and exit
sky serve logs --load-balancer --no-follow [SERVICE_NAME]
# Tail the logs of replica 1
sky serve logs [SERVICE_NAME] 1

I find they should be implemented in sky/utils/command_runner.py (Details omitted for brevity):

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def run(self,
        cmd: Union[str, List[str]],
        *,
        require_outputs: bool = False,
        # Advanced options.
        log_path: str = os.devnull,
        # If False, do not redirect stdout/stderr to optimize performance.
        process_stream: bool = True,
        stream_logs: bool = True,
        ssh_mode: SshMode = SshMode.NON_INTERACTIVE,
        separate_stderr: bool = False,
        connect_timeout: Optional[int] = None,
        source_bashrc: bool = False,
        skip_lines: int = 0,
        **kwargs) -> Union[int, Tuple[int, str, str]]:
    '''......'''
    raise NotImplementedError

But now this run() function is not implemented? I don't understand how do we run the corresponding cmds in CLI?