跳转至

Lecture 13 Ray: A universal framework for distributed computing

Intro

Trends

  • Apps increasingly incorporate AI
  • AI demands exploding
    • Not only compute, but memory
  • The end of Moore's Law
  • Specialized models not enough
    • Growing gap between memory demand and supply

No way out but to distribute these workloads

Challenges

alt text

Challenge: need to scale every stage

alt text

The situation is that: currently each stage has numerous support system, but different stages are not designed with compatibility.

So the challenge is:

  • Hard to develop
  • Hard to deploy
  • Hard to manage
  • Slow as data between different stages is transferred via dist. storage systems
  • "End to end" failure semantics (API level)
  • Reliability semantics (API level)

In this scenario, we offer Ray as a universal framework for distributed computing.

Ray

One system to support all these workloads

alt text

Minimalist API

Here we show the core part of Ray API. In fact, they can cover 95% of your needs in regular development.

alt text

FAST Programming Model

The core components of Ray can be summarized as:

  • Futures: reference to objects (possible not created yet)
  • Actors: remote objects (class instance)
  • Shared in-memory object store: ...
  • Tasks: remote functions

Usage

Common Python Instance

alt text

Time consuming is 2s.

In Ray

alt text

alt text

alt text

alt text

alt text

alt text

Time consuming is 1s.

Similar to Python, easy to start

alt text

alt text

Distributed object store in Ray

In Ray, "pass-by-reference" is utilized instead of "pass-by-value".

alt text

alt text

alt text

Comparison: Traditional RPC

alt text

Ray Ecosystem

alt text

Building 2nd generation ML infra/platform

Why?

  • 1st generation built by stitching a bunch of system and tools → difficult to develop, maintain, evolve
  • Ray (promise to) address these limitations

Original Design

Ray Architecture

  • In-memory obj. store -> immutable
  • Distributed scheduler
  • Central control store (GCS)

alt text

alt text

alt text

GCS

GCS is used to store some metadata (tables), but not the data itself.

Eg. data location / reference relationship...

Scalability

alt text

  • Decentralized scheduler
  • Shared GCS
  • Any worker can submit tasks
    • Driver not a bottleneck

Fault Tolerance

Lineage based: Replay computation to reconstruct lost objects.

(1) Scenario

To be specific, if there is a task which contains subtask 1, 2, 3, 4.

The process is that task 1 is assigned to 1st worker, task 2 is assigned to 2nd worker, task 3 is assigned to 3rd worker, task 4 is assigned to 4th worker.

These workers are working "simultaneously".

If 2 fails, what should we do? (3 and 4 are futures of 2)

(2) How to solve

In fact, when we notice this task, we will automatically tell GCS the relationship, which is 1 -> 2, 2 -> 3, 3 -> 4.

When 2 fails, we can just are the direct boss of 2, which is 3.

  1. We speak to 3: "your son 2 is dead, so you need to "save" him."
  2. 3 says: "copy! I will save him now."
  3. 3 saves 2: "2, you are dead, please redo!"
  4. 2 redo the task.
  5. We are good now.

Ownership

Based on the example above, we can understand "who should be responsible for who" relationship.

Actually, Ray gives this relationship a analogy, "Ownership".

alt text

Ephemeral resources

alt text

Summary

  1. Distributed (AI) apps are becoming the norm
  2. Building distributed apps is extremely hard
  3. Ray universal framework that dramatically simplify distributed computing