Lecture 11 UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications¶

alt text

Why we need UPC++

We can review SMP and HPC model in previous slides. There are some cons like when dealing with cache remote data.

Hence we offer a new model called Partitioned Global Address Space (PGAS).

And a new programming language is generated for this, UPC, UPC++.

Its traits contains:

Communicate by reading/writing memory
Easier to program
Better to scale
Never cache remote data

Some motivating system trends

The first exascale(超级规模) systems are being deployed now

Cores per node is growing
Cores are getting simpler (including GPU cores)
Memory per core is dropping
Latency is not improving

Need to reduce communication costs in software!

Overlap communication to hide latency
Reduce memory using smaller, more frequent messages
Minimize software overhead
Use simple messaging protocols (RDMA)

Reducing communication overhead

Our goal: each process directly access another’s memory via a global pointer

Obviously, this method -> Communication is one-sided.
Like shared memory: shared data structures with asynchronous access.

alt text

UPC++: A Programming Library for PGAS

UPC++ leverages C++ standards, needs only a standard C++ compiler.
Relies on GASNet-EX for low-overhead communication
- Efficiently utilizes network hardware, including RDMA
- Enables portability
- Provides Active Messages on which UPC++ RPCs are built
Designed for interoperability
- Same process model as MPI, enabling hybrid applications
- OpenMP and CUDA can be mixed with UPC++ as in MPI+X

RPCs

RPCs（Remote Procedure Calls，远程过程调用）是UPC++中的一个重要通信功能，具有以下特点：

允许调用者在远程进程上执行用户定义的函数，包括传递参数并可选择性地返回结果给发送方
所有RPC操作都是==异步的==，支持灵活的同步机制

工作原理

RPC的执行过程涉及发起方和目标方两端：

当RPC从发送方分派后，UPC++运行时在发起方的actQ中放置一个promise
GASNet-EX使用Active Messages（AM）将有效负载移动到目标端（底层工作，这里不说）
在目标端，传入的RPC（包括lambda或函数指针及其参数）被插入目标的compQ中
RPC执行完成后，其返回值通过AM发送回发起方

特点

使得在远程进程上的数据操作变得简单
支持高效的分布式内存并行计算
所有通信操作在语法上都是显式的，这有助于程序员考虑通信和数据移动的成本

Basic for UPC++¶

Execution model: SPMD¶

UPC++ uses a SPMD model: fixed number of processes run the same program

SPMD Model

SPMD（Single Program Multiple Data，单程序多数据）是一种并行计算模型。

SPMD 允许多个处理器协同执行同一个程序以加快获得结果，但每个处理器可以处理不同的数据。

C
int main() { 
    upcxx::init(); // Set up UPC++ runtime
    cout << "Hello from " << upcxx::rank_me() << endl; 

    upcxx::barrier(); 

    if (upcxx::rank_me() == 0) 
        cout << "Done." << endl; 

    upcxx::finalize(); // Close down UPC++ runtime
}

alt text

Compiling and running a UPC++ program

Compiler wrapper:

C
// how to compile
upcxx -g hello-world.cpp -o hello-world.exe

Launch wrapper:

C
// how to run
upcxx-run -np 4 ./hello-world.exe

Global pointers¶

Global pointers are used to create logically shared but physically distributed data structures.

Parameterized by the type of object it points to, e.g. global_ptr<double>.

alt text

What does UPC++ offer¶

Overview: Asynchronous behavior

RMA:
- Get/put to a remote location in another address space
- Low overhead, zero-copy, one-sided communication.
RPC: Remote Procedure Call:
- Moves computation to the data

RMA is just give to and fetch from, but RPC is more powerful to offer computation (I am boss, follow my orders).

Asynchronous communication (RMA)¶

alt text

Private vs. Shared Memory in UPC++¶

alt text

Broadcast in UPC++¶

Aim:

To write an interesting program, we need to have global pointers refer to remote data

One approach is to broadcast the pointer:

C
global_ptr<int> gptr = broadcast(new_<int>(24),0).wait();

// new_<int>(24): 在进程0上分配一个整数并初始化为24
// broadcast(): 将这个全局指针广播到所有参与计算的进程
// wait(): 等待广播操作完成
// global_ptr<int> gptr: 将结果存储在一个全局指针中

alt text

Remote access: Put / Get

We access shared memory from a remote rank with remote get / put (review get and put in MPI)

Both done via a global pointer to the remote variable
- rget – read a variable on a remote rank
- rput – write a variable on a remote rank
But will take microseconds at best
- For most of that time, the processor is just waiting...

Based on "most of that time, the processor is just waiting", of course we need Asynchronous communication to utilize the gap time.

Asynchronous remote operations

Asynchronous execution used to hide remote latency
Asynchronous get: start reading, but how to tell if you’re done?
- Put the results into a special "box" called a future

alt text

C
future<int> fut_temp = rget(gptr);

// Still working on rget operation.
// Don't waste time! ... Do something expensive.

int x = fut_temp.wait(); 

// Done! Now all ranks have x = 24.

UPC++ Synchronization¶

UPC++ has two basic forms of barriers:

Synchronous Barrier:
- block until all other threads arrive (usual) barrier();
- the same as barrier in OpenMP

2) Asynchronous barriers:

C
future<> f = barrier_async(); // this thread is ready for barrier 
// do computation unrelated to barrier
f.wait; // wait for others to be ready

Synchronous vs. Asynchronous Barriers

异步屏障允许线程在等待其他线程到达屏障点时继续执行其他不相关的计算任务，这与同步屏障形成鲜明对比。

特性	同步屏障	异步屏障
等待方式	立即阻塞	可延迟阻塞
执行模式	必须等待所有线程	可执行其他任务
灵活性	较低	较高
适用场景	严格同步要求	允许重叠计算

C
// Asynchronous barrier example

// 假设我们有一个并行计算任务，分为两个阶段
// 第一阶段需要所有线程同步，但第二阶段不需要

// 第一部分计算需要同步
compute_phase1();

future<> f = barrier_async();  // 标记同步点

// 继续执行一些独立计算
prepare_local_data();    // 处理本地数据
setup_next_phase();      // 准备下一阶段计算

f.wait();  // 等待所有线程完成第一阶段

// 开始下一阶段计算
compute_phase2();

Downcasting global pointers

If a process has affinity to an object referenced by a global pointer (i.e., it is local), it can downcast the global pointer into a raw pointer with local()

alt text

使用global pointer进行本地计算后同步结果到所有进程,有以下几种主要方式：

对于简单更新操作:

C
#pragma omp atomic
*global_ptr = local_result;  // 原子更新

对于复杂操作序列:

C
#pragma omp critical {
    // 执行本地计算
    local_ptr = local(global_ptr);
    *local_ptr = new_value;
    // 同步回全局
    *global_ptr = *local_ptr;
}

Using Atomics UPC++

alt text

UPC++ Collectives¶

alt text

Remote procedure call (RPC)¶

Recall we mentioned that RPC is super powerful! (I am boss, I can tell you what to do.)

alt text

Futures in general¶

A future holds a sequence of values and a state (ready / not ready).
Waiting on the returned future lets user tailor degree of asynchrony.

C
future<T> f1 = rget(gptr1); // asynchronous op
future<T> f2 = rget(gptr2);

bool ready = f1.ready(); // non-blocking poll (a probe)
if !ready {
    // ... Don't waste time!
    // unrelated work...
}

T t = f1.wait(); // waits if not ready

UPC++ has no implicit blocking: Except for synchronizing operations like "wait" others are implicitly nonblocking (asynchronous)

Callbacks¶

alt text

Syntax: do something when future is ready.

cmd: "When ... is done, you should utilize its data as arguments to do ..."

We can also use Chaining callbacks to create a pipeline of operations

alt text

Distributed objects¶

alt text

Distributed hash table (DHT)¶

Distributed analogy of std::unordered_map.

alt text

DHT data representation

A distributed object represents the directory of unordered maps.

C
class DistrMap {
    using dobj_map_t = dist_object<unordered_map<string, string>>;

    // Construct empty map 
    dobj_map_t local_map{{}};

    int get_target_rank(const string &key) { 
        return std::hash<string>{}(key) % rank_n(); 
    }
};

DHT insertion

Insertion initiates an RPC to the owner and returns a future that represents completion of the insert

alt text

DHT find

The find function also uses RPC and returns a future.

alt text

Application Case Studies¶

Ignored here.