Lecture 13 Ray: A universal framework for distributed computing¶
Intro¶
Trends
- Apps increasingly incorporate AI
- AI demands exploding
- Not only compute, but memory
- The end of Moore's Law
- Specialized models not enough
- Growing gap between memory demand and supply
No way out but to distribute these workloads
Challenges
Challenge: need to scale every stage
The situation is that: currently each stage has numerous support system, but different stages are not designed with compatibility.
So the challenge is:
- Hard to develop
- Hard to deploy
- Hard to manage
- Slow as data between different stages is transferred via dist. storage systems
- "End to end" failure semantics (API level)
- Reliability semantics (API level)
In this scenario, we offer Ray as a universal framework for distributed computing.
Ray¶
One system to support all these workloads
Minimalist API¶
Here we show the core part of Ray API. In fact, they can cover 95% of your needs in regular development.
FAST Programming Model¶
The core components of Ray can be summarized as:
- Futures: reference to objects (possible not created yet)
- Actors: remote objects (class instance)
- Shared in-memory object store: ...
- Tasks: remote functions
Usage¶
Common Python Instance
Time consuming is 2s.
In Ray
Time consuming is 1s.
Similar to Python, easy to start
Distributed object store in Ray¶
In Ray, "pass-by-reference" is utilized instead of "pass-by-value".
Comparison: Traditional RPC
Ray Ecosystem¶
Building 2nd generation ML infra/platform
Why?
- 1st generation built by stitching a bunch of system and tools → difficult to develop, maintain, evolve
- Ray (promise to) address these limitations
Original Design¶
Ray Architecture¶
- In-memory obj. store -> immutable
- Distributed scheduler
- Central control store (GCS)
GCS
GCS is used to store some metadata (tables), but not the data itself.
Eg. data location / reference relationship...
Scalability¶
- Decentralized scheduler
- Shared GCS
- Any worker can submit tasks
- Driver not a bottleneck
Fault Tolerance¶
Lineage based: Replay computation to reconstruct lost objects.
(1) Scenario
To be specific, if there is a task which contains subtask 1, 2, 3, 4.
The process is that task 1 is assigned to 1st worker, task 2 is assigned to 2nd worker, task 3 is assigned to 3rd worker, task 4 is assigned to 4th worker.
These workers are working "simultaneously".
If 2 fails, what should we do? (3 and 4 are futures of 2)
(2) How to solve
In fact, when we notice this task, we will automatically tell GCS the relationship, which is 1 -> 2, 2 -> 3, 3 -> 4.
When 2 fails, we can just are the direct boss of 2, which is 3.
- We speak to 3: "your son 2 is dead, so you need to "save" him."
- 3 says: "copy! I will save him now."
- 3 saves 2: "2, you are dead, please redo!"
- 2 redo the task.
- We are good now.
Ownership
Based on the example above, we can understand "who should be responsible for who" relationship.
Actually, Ray gives this relationship a analogy, "Ownership".
Ephemeral resources¶
Summary¶
- Distributed (AI) apps are becoming the norm
- Building distributed apps is extremely hard
- Ray universal framework that dramatically simplify distributed computing