Lecture 11 UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications¶
Why we need UPC++
We can review SMP and HPC model in previous slides. There are some cons like when dealing with cache remote data.
Hence we offer a new model called Partitioned Global Address Space (PGAS).
And a new programming language is generated for this, UPC, UPC++.
Its traits contains:
- Communicate by reading/writing memory
- Easier to program
- Better to scale
- Never cache remote data
Some motivating system trends
The first exascale(超级规模) systems are being deployed now
- Cores per node is growing
- Cores are getting simpler (including GPU cores)
- Memory per core is dropping
- Latency is not improving
Need to reduce communication costs in software!
- Overlap communication to hide latency
- Reduce memory using smaller, more frequent messages
- Minimize software overhead
- Use simple messaging protocols (RDMA)
Reducing communication overhead
Our goal: each process directly access another’s memory via a global pointer
- Obviously, this method -> Communication is one-sided.
- Like shared memory: shared data structures with asynchronous access.
UPC++: A Programming Library for PGAS
- UPC++ leverages C++ standards, needs only a standard C++ compiler.
- Relies on GASNet-EX for low-overhead communication
- Efficiently utilizes network hardware, including RDMA
- Enables portability
- Provides Active Messages on which UPC++ RPCs are built
- Designed for interoperability
- Same process model as MPI, enabling hybrid applications
- OpenMP and CUDA can be mixed with UPC++ as in MPI+X
RPCs
RPCs(Remote Procedure Calls,远程过程调用)是UPC++中的一个重要通信功能,具有以下特点:
- 允许调用者在远程进程上执行用户定义的函数,包括传递参数并可选择性地返回结果给发送方
- 所有RPC操作都是==异步的==,支持灵活的同步机制
工作原理
RPC的执行过程涉及发起方和目标方两端:
- 当RPC从发送方分派后,UPC++运行时在发起方的actQ中放置一个promise
- GASNet-EX使用Active Messages(AM)将有效负载移动到目标端(底层工作,这里不说)
- 在目标端,传入的RPC(包括lambda或函数指针及其参数)被插入目标的compQ中
- RPC执行完成后,其返回值通过AM发送回发起方
特点
- 使得在远程进程上的数据操作变得简单
- 支持高效的分布式内存并行计算
- 所有通信操作在语法上都是显式的,这有助于程序员考虑通信和数据移动的成本
Basic for UPC++¶
Execution model: SPMD¶
UPC++ uses a SPMD model: fixed number of processes run the same program
SPMD Model
SPMD(Single Program Multiple Data,单程序多数据)是一种并行计算模型。
SPMD 允许多个处理器协同执行同一个程序以加快获得结果,但每个处理器可以处理不同的数据。
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Compiling and running a UPC++ program
Compiler wrapper:
C | |
---|---|
1 2 |
|
Launch wrapper:
C | |
---|---|
1 2 |
|
Global pointers¶
Global pointers are used to create logically shared but physically distributed data structures.
Parameterized by the type of object it points to, e.g. global_ptr<double>
.
What does UPC++ offer¶
Overview: Asynchronous behavior
- RMA:
- Get/put to a remote location in another address space
- Low overhead, zero-copy, one-sided communication.
- RPC: Remote Procedure Call:
- Moves computation to the data
RMA is just give to and fetch from, but RPC is more powerful to offer computation (I am boss, follow my orders).
Asynchronous communication (RMA)¶
Private vs. Shared Memory in UPC++¶
Broadcast in UPC++¶
Aim:
To write an interesting program, we need to have global pointers refer to remote data
One approach is to broadcast the pointer:
C | |
---|---|
1 2 3 4 5 6 |
|
Remote access: Put / Get
We access shared memory from a remote rank with remote get
/ put
(review get and put in MPI)
- Both done via a global pointer to the remote variable
rget
– read a variable on a remote rankrput
– write a variable on a remote rank
- But will take microseconds at best
- For most of that time, the processor is just waiting...
Based on "most of that time, the processor is just waiting", of course we need Asynchronous communication to utilize the gap time.
Asynchronous remote operations
- Asynchronous execution used to hide remote latency
- Asynchronous get: start reading, but how to tell if you’re done?
- Put the results into a special "box" called a future
C | |
---|---|
1 2 3 4 5 6 7 8 |
|
UPC++ Synchronization¶
UPC++ has two basic forms of barriers:
- Synchronous Barrier:
- block until all other threads arrive (usual)
barrier()
; - the same as barrier in OpenMP
- block until all other threads arrive (usual)
2) Asynchronous barriers:
C | |
---|---|
1 2 3 |
|
Synchronous vs. Asynchronous Barriers
异步屏障允许线程在等待其他线程到达屏障点时继续执行其他不相关的计算任务,这与同步屏障形成鲜明对比。
特性 | 同步屏障 | 异步屏障 |
---|---|---|
等待方式 | 立即阻塞 | 可延迟阻塞 |
执行模式 | 必须等待所有线程 | 可执行其他任务 |
灵活性 | 较低 | 较高 |
适用场景 | 严格同步要求 | 允许重叠计算 |
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Downcasting global pointers
If a process has affinity to an object referenced by a global pointer (i.e., it is local), it can downcast the global pointer into a raw pointer with local()
使用global pointer进行本地计算后同步结果到所有进程,有以下几种主要方式:
- 对于简单更新操作:
C 1 2
#pragma omp atomic *global_ptr = local_result; // 原子更新
- 对于复杂操作序列:
C 1 2 3 4 5 6 7
#pragma omp critical { // 执行本地计算 local_ptr = local(global_ptr); *local_ptr = new_value; // 同步回全局 *global_ptr = *local_ptr; }
Using Atomics UPC++
UPC++ Collectives¶
Remote procedure call (RPC)¶
Recall we mentioned that RPC is super powerful! (I am boss, I can tell you what to do.)
Futures in general¶
- A future holds a sequence of values and a state (ready / not ready).
- Waiting on the returned future lets user tailor degree of asynchrony.
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
UPC++ has no implicit blocking: Except for synchronizing operations like "wait" others are implicitly nonblocking (asynchronous)
Callbacks¶
Syntax: do something when future is ready.
cmd: "When ... is done, you should utilize its data as arguments to do ..."
We can also use Chaining callbacks to create a pipeline of operations
Distributed objects¶
Distributed hash table (DHT)¶
Distributed analogy of std::unordered_map
.
DHT data representation
A distributed object represents the directory of unordered maps.
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
DHT insertion
Insertion initiates an RPC to the owner and returns a future that represents completion of the insert
DHT find
The find function also uses RPC and returns a future.
Application Case Studies¶
Ignored here.