Lecture 13 Ray: A universal framework for distributed computing¶

Intro¶

Trends

Apps increasingly incorporate AI
AI demands exploding
- Not only compute, but memory
The end of Moore's Law
Specialized models not enough
- Growing gap between memory demand and supply

No way out but to distribute these workloads

Challenges

alt text

Challenge: need to scale every stage

alt text

The situation is that: currently each stage has numerous support system, but different stages are not designed with compatibility.

So the challenge is:

Hard to develop
Hard to deploy
Hard to manage
Slow as data between diﬀerent stages is transferred via dist. storage systems
"End to end" failure semantics (API level)
Reliability semantics (API level)

In this scenario, we offer Ray as a universal framework for distributed computing.

Ray¶

One system to support all these workloads

alt text

Minimalist API¶

Here we show the core part of Ray API. In fact, they can cover 95% of your needs in regular development.

alt text

FAST Programming Model¶

The core components of Ray can be summarized as:

Futures: reference to objects (possible not created yet)
Actors: remote objects (class instance)
Shared in-memory object store: ...
Tasks: remote functions

Usage¶

Common Python Instance

alt text

Time consuming is 2s.

In Ray

alt text

Time consuming is 1s.

Similar to Python, easy to start

alt text

Distributed object store in Ray¶

In Ray, "pass-by-reference" is utilized instead of "pass-by-value".

alt text

Comparison: Traditional RPC

alt text

Ray Ecosystem¶

alt text

Building 2nd generation ML infra/platform

Why?

1st generation built by stitching a bunch of system and tools → difficult to develop, maintain, evolve
Ray (promise to) address these limitations

Original Design¶

Ray Architecture¶

In-memory obj. store -> immutable
Distributed scheduler
Central control store (GCS)

alt text

GCS

GCS is used to store some metadata (tables), but not the data itself.

Eg. data location / reference relationship...

Scalability¶

alt text

Decentralized scheduler
Shared GCS
Any worker can submit tasks
- Driver not a bottleneck

Fault Tolerance¶

Lineage based: Replay computation to reconstruct lost objects.

(1) Scenario

To be specific, if there is a task which contains subtask 1, 2, 3, 4.

The process is that task 1 is assigned to 1st worker, task 2 is assigned to 2nd worker, task 3 is assigned to 3rd worker, task 4 is assigned to 4th worker.

These workers are working "simultaneously".

If 2 fails, what should we do? (3 and 4 are futures of 2)

(2) How to solve

In fact, when we notice this task, we will automatically tell GCS the relationship, which is 1 -> 2, 2 -> 3, 3 -> 4.

When 2 fails, we can just are the direct boss of 2, which is 3.

We speak to 3: "your son 2 is dead, so you need to "save" him."
3 says: "copy! I will save him now."
3 saves 2: "2, you are dead, please redo!"
2 redo the task.
We are good now.

Ownership

Based on the example above, we can understand "who should be responsible for who" relationship.

Actually, Ray gives this relationship a analogy, "Ownership".

alt text

Ephemeral resources¶

alt text

Summary¶

Distributed (AI) apps are becoming the norm
Building distributed apps is extremely hard
Ray universal framework that dramatically simplify distributed computing