Lecture 9: Distributed Memory Machines and Programming¶

Today we will talk about distributed memory machines and programming.

Pls review HPC model mentioned before.

We need to consider two dimension of this model:

Distributed Memory Architectures
- Communicated via networks
- Network topology
- Performance model
Programming HPC by Message Passing
- Overview of MPI
- How to use MPI
- Non-blocking communication
- Collectives

Distributed Memory Architectures¶

Network Analogies and Statements¶

Analogy

Link
Switch
Distances (hops)
Routing Algorithm
Latency
Bandwidth
Topology
Switching strategy
Flow Control

Statement

Latency is key for programs with many small messages
Latency: delay between send and receive times
- Vendors often report hardware latencies (wire time)
- Application programmers care about software latencies (user program to user program)
Diameter(直径): the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes.
- 计算图中每一对节点之间的最短路径
- 在这些最短路径中找出最长的那个
- 这个最长的最短路径的长度就是图的直径
Bisection Bandwidth: bandwidth across smallest cut that divides network into two equal halves
- Bandwidth across "narrowest" part of the network
- Bisection bandwidth is important for algorithms in which all processors need to communicate with all others
- Represents the worst case for all-to-all communication

Bisection Bandwidth

双分带宽衡量的是 worst-case（最坏情况）的网络性能。具体来说：

双分带宽是将网络分成两个相等大小的部分时，这两部分之间可用的最小带宽。它通过寻找所有可能的对半分割方式中，具有最小带宽的那种分割来衡量网络性能。

双分带宽被设计用来提供网络性能的下限评估：

它代表了网络在最不利分割情况下的带宽容量
它反映了网络在最坏情况下的通信能力
它可以用来预测网络在高负载情况下可能出现的瓶颈

然而，需要注意的是，双分带宽并不总是能准确反映网络的实际性能。在某些情况下，实际网络吞吐量可能：

高于双分带宽预测值，因为进程到主机的实际映射可能比最坏情况更优
与双分带宽的增长速率不同，这取决于网络拓扑结构

Performance Properties of a Network: Bisection Bandwidth

alt text

虽然绿线确实将网格划分成了相等数量的节点，但它不是最优的双分带宽切割，因为：

切割边数过多

绿线的切割方式穿过了更多的链路
这种切割方式需要切断更多的连接
产生的带宽开销大于蓝线的切割方式

最优解特征

以最少的切割边数（\(\sqrt{p}\) 条边）将网络分成相等的两部分
产生最小的带宽开销
这就是为什么双分带宽定义为\(\sqrt{p} \times\) 链路带宽

所以虽然绿线确实将网络分成了相等的两部分，但它不是双分带宽，因为双分带宽寻求的是所有可能的对等分割中需要切断最少链路的那种分割方式。

Network Topology¶

Linear and Ring

alt text

Meshes and Tori

alt text

Hypercubes

Number of nodes \(n = 2^d\) for dimension d.

Diameter = d.
Bisection bandwidth = n/2.

alt text

Greycode addressing:

Each node connected to d others with 1 bit different.

alt text

Trees

alt text

Butterflies

alt text

Dragonflies

Combine in hierarchy
- Several groups are connected together using all to all links, i.e. each group has at least one link directly to each other group.
- The topology inside each group can be any topology.
Uses a randomized routing algorithm
Outcome: programmer can (usually) ignore topology, get good performance
- Important in virtualized, dynamic environment.
- Drawback: variable performance.

My View：

任意topo套娃，适合虚拟化的动态网络负载，但是性能不稳定。
程序员不需要关心内部topo实现，只需要关心接口。
采取的是随机路由算法，而不是最短路由，这样可以保证极端流量下的负载均衡

alt text

Why Random Routing¶

alt text

Minimal routing works well when things are load balanced, potentially catastrophic in adversarial traffic patterns.

Performance Models¶

Alpha-Beta Model¶

alt text

Time = PropagationDelay(\(alpha\)) + n x TransmissionDelay(\(beta\)).
One long message is cheaper than many short ones.
Need large computation-to-communication ratio to be efficient.

Current Issue¶

Processors are multi-core and many nodes are multi-chip.
NIC bandwidth bottleneck is becoming more common.

alt text

Programming: Message Passing¶

Programming Distributed Memory Machines with Message Passing

Basic Line¶

Two important questions that arise early in a parallel program are:

How many processes are participating in this computation?
Which one am I?

C
MPI_Comm_size // total number of processes
MPI_Comm_rank // rank identifying ID the calling process

Example:

C
#include "mpi.h" 
#include <stdio.h>

int main( int argc, char *argv[] ) 
{
    int rank, size; 
    MPI_Init( &argc, &argv ); 
    MPI_Comm_rank( MPI_COMM_WORLD, &rank ); 
    MPI_Comm_size( MPI_COMM_WORLD, &size ); 
    printf( "I am %d of %d\n", rank, size ); 
    MPI_Finalize(); 
    return 0;
}

All MPI programs begin with MPI_Init and end with MPI_Finalize.
MPI_COMM_WORLD is defined by mpi.h (in C) and designates all processes in the MPI "job".
Each statement executes independently in each process.
- including the printf/print statements.
Use mpirun –np 4 a.out to run.
- -np specifies the number of processes.
- 4 is the setting number.
- a.out is the executable.

Send and Recv¶

alt text

Some Basic Concepts

Keywords: process / group / message / context / communicator / WORLD / rank

Processes can be collected into groups.
Each message can be sent in a context, and must be received in the same context.
A group and context together form a communicator.
A process is identified by its rank in the group associated with a communicator.
There is a default communicator whose group contains all initial processes, called MPI_COMM_WORLD.

MPI Datatypes

The data in a message to send or receive is described by a triple (address, count, datatype), where a MPI datatype can be recursively defined.

May hurt performance if datatypes are complex.

C
message(addr, count, datatype) // message form
datatypes: MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR, MPI_BYTE ... 
// and can be recursively defined

MPI Tags

Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.

MPI Basic (Blocking) Send¶

alt text

C
// format
MPI_SEND (start, count, datatype, dest, tag, comm)

MPI_SEND Parameters

start: Starting address of the send buffer containing the data to be sent
count: Number of elements to send
datatype: MPI datatype of each element (e.g., MPI_INT, MPI_FLOAT)
dest: Rank of the destination process in the communicator
tag: Message tag to identify the message (integer value)
comm: Communicator containing both sending and receiving processes

comm

Of course we need to let comm the same in MPI_SEND and MPI_RECV.

When MPI_SEND function returns, the data has been delivered to the system and the buffer can be reused.
The message may not have been received by the target process.
Return meaning:
- Sender: "Ok, I have send and I don't care whether you can receive, now I can release these area."
- Receiver: "Oh? Maybe I can receive it or ..."

MPI Basic (Blocking) Receive¶

alt text

C
// format
MPI_RECV (start, count, datatype, source, tag, comm, status)

Waits until a matching (both source and tag) message is received from the system, and the buffer can be used.
source is rank in communicator specified by comm, or MPI_ANY_SOURCE
tag is a tag to be matched or MPI_ANY_TAG
receiving fewer than count occurrences of datatype is OK, but receiving more is an error
- Sender: I send 100 packets
- Receiver:
  - I receive 100 packets, we are all good!
  - I receive only receive 10 packets, there might be some packet loss.
  - I receive 101 packets? Are you kidding me?
status contains further information (e.g. size of message)

Example for Communicating¶

C
#include "mpi.h"
#include <stdio.h> 

int main( int argc, char *argv[]) 
{
    int rank, buf; 
    MPI_Status status; 
    MPI_Init(&argv, &argc); 
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );

    /* Process 0 sends and Process 1 receives */ 
    if (rank == 0) {
        buf = 123456;
        MPI_Send( &buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
    } 
    else if (rank == 1) {
        MPI_Recv( &buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status );
        printf( “ Received %d\n ” , buf ); 
    }

    MPI_Finalize(); return 0;
}

One code, but 2 users:

The code is for both sender and recv processes.
Different process will have different response with this code, because of the if rank statement.

Retrieving Further Information¶

As mentioned above, Status is a data structure allocated in the user' s program.

C
int recvd_tag, recvd_from, recvd_count; 
MPI_Status status; 
MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status ) 
recvd_tag = status.MPI_TAG; 
recvd_from = status.MPI_SOURCE; 
MPI_Get_count( &status, datatype, &recvd_count );

Most MPI applications can be written with only 6 functions (although which 6 may differ).

Point-to-point communication:

C
MPI_INIT // start line
MPI_FINALIZE // end line
MPI_COMM_SIZE // total number of processes
MPI_COMM_RANK // current rank (ID)
MPI_SEND // send data
MPI_RECEIVE // receive data

Collectives:

C
MPI_INIT // start line
MPI_FINALIZE // end line
MPI_COMM_SIZE // total number of processes
MPI_COMM_RANK // current rank (ID)
MPI_BCAST // broadcast ope
MPI_REDUCE // reduce ope

Blocking and Non-blocking Communication¶

Currently, we introduce so many blocking communication functions:

alt text

Unsafe Code

Point: If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive)

alt text

两个进程都先尝试发送数据(Send)，然后才接收数据(Recv)
如果系统缓冲区不足，两个进程都会在Send操作处阻塞
由于双方都在等待对方接收数据，但又都无法进入接收状态，从而形成死锁

Solution 1

alt text

（1）重排操作顺序的解决方案

在第一种解决方案中，通过调整通信操作的顺序来避免死锁：

Process 0: 先Send后Recv
Process 1: 先Recv后Send

这种方式之所以有效，是因为：

确保了至少有一个进程能够接收消息
==避免了两个进程同时阻塞在Send操作==上的情况
建立了明确的消息传递顺序

（2）使用非阻塞通信的解决方案

第二种解决方案使用MPI_Sendrecv函数：

Process 0: Sendrecv(1)
Process 1: Sendrecv(0)

这种方式有效的原因是：

Sendrecv是一个==原子操作==，同时处理发送和接收
内部实现会自动处理缓冲区管理

Solution 2

alt text

We will introduce B / R in the below part.

Non-blocking Operations¶

非阻塞操作会立即返回一个"请求句柄"(request handle)，这个句柄可以用来测试和等待操作的完成。主要涉及以下几个关键部分：

基本数据类型

MPI_Request：用于存储请求信息
MPI_Status：用于存储状态信息

主要函数

MPI_Isend：非阻塞发送数据
MPI_Irecv：非阻塞接收数据
MPI_Wait：等待请求完成
MPI_Test：测试请求是否完成

重要说明

每个请求必须通过 MPI_Wait 进行等待处理
可以使用 MPI_Test 在不等待的情况下测试请求是否完成
特别注意：在请求完成之前访问数据缓冲区是未定义行为，可能导致程序出错

这些操作的设计目的是提高程序的并行效率，允许在通信的同时执行其他计算任务。

C
// message situation
MPI_Request request;
MPI_Status status;
// immediately send or recv
MPI_Isend(start, count, datatype, dest, tag, comm, &request);
MPI_Irecv(start, count, datatype, dest, tag, comm, &request);
// wait for done
MPI_Wait(&request, &status); // each request must be Waited on
// test for done
MPI_Test(&request, &flag, &status); // can also test without waiting

Multiple Completions¶

Sometimes it's desirable to wait on multiple requests:

(1) MPI_Waitall: wait for all requests to complete

C
MPI_Waitall(count, array_of_requests, array_of_statuses)

Need all requests to complete before continuing.

Text Only
// 服务员必须等所有4个顾客都点完餐才能继续
MPI_Waitall(4, requests, statuses);
// 即使顾客1和3已经点完了，也要等2和4点完才能处理

相当于服务员说："我要等所有顾客都点完餐，才开始处理所有订单"

Return Value

MPI_SUCCESS：如果所有请求都成功完成
MPI_ERR_IN_STATUS：如果一个或多个请求失败。在这种情况下，每个请求的具体错误状态会记录在 array_of_statuses 中
其他错误码：如果因其他原因（如参数无效）失败

(2) MPI_Waitany: wait for any request to complete

C
MPI_Waitany(count, array_of_requests, &index, &status)

Wait for any request to complete, and return the index of the completed request.

Text Only
int index;
MPI_Waitany(4, requests, &index, &status);
// 返回最先完成的那个请求的索引
// 比如 index = 1，表示顾客1最先点完

相当于服务员说："谁先点完我就先处理谁的订单"

Return Value

MPI_SUCCESS：成功完成一个请求
index：返回完成的请求在数组中的索引

PS: 如果没有活跃的请求，index 会被设置为 MPI_UNDEFINED

(3) MPI_Waitsome: wait for some requests to complete

C
MPI_Waitsome(count, array_of_requests, array_of indices, array_of_statuses)

If some processes are done, I will return the indices of the completed requests. And resting processes will be done then.

C
MPI_Waitsome(4, requests, &completed_count, indices, statuses);
// completed_count = 2, indices = {1, 3}
// 表示顾客1和3都点完了，一起处理这两个订单

相当于服务员说："我看看现在有哪些顾客点完了，把他们的订单一起处理了，余下的稍后看"

Return Value

MPI_SUCCESS：如果成功完成一个或多个请求
outcount：返回完成的请求数量
array_of_indices：返回已完成请求的索引数组
MPI_ERR_IN_STATUS：如果一个或多个操作失败

There are corresponding versions of test for each of these, ignored here.

Communication Modes¶

For a Sender

Blocking versions
- Synchronous mode (MPI_Ssend): 普通模式
  - the send does not complete until a matching receive has begun.
  - may lead to Unsafe programs deadlock.
- Buffered mode (MPI_Bsend): 大款用户
  - the user supplies a buffer to the system for its use.
  - user allocates enough memory to make an unsafe program safe.
- Ready mode (MPI_Rsend): 使用Fast Protocol
  - user guarantees that a matching receive has been posted.
  - allow access to fast protocols.
  - undefined behavior if matching receive not posted.
Non-blocking versions
- MPI_Issend

For a Receiver

MPI_Recv receives messages sent in any mode.