Lecture 9: Distributed Memory Machines and Programming¶
Today we will talk about distributed memory machines and programming.
Pls review HPC model mentioned before.
We need to consider two dimension of this model:
- Distributed Memory Architectures
- Communicated via networks
- Network topology
- Performance model
- Programming HPC by Message Passing
- Overview of MPI
- How to use MPI
- Non-blocking communication
- Collectives
Distributed Memory Architectures¶
Network Analogies and Statements¶
Analogy
- Link
- Switch
- Distances (hops)
- Routing Algorithm
- Latency
- Bandwidth
- Topology
- Switching strategy
- Flow Control
Statement
- Latency is key for programs with many small messages
- Latency: delay between send and receive times
- Vendors often report hardware latencies (wire time)
- Application programmers care about software latencies (user program to user program)
- Diameter(直径): the maximum (over all pairs of nodes) of the shortest path between a given pair of nodes.
- 计算图中每一对节点之间的最短路径
- 在这些最短路径中找出最长的那个
- 这个最长的最短路径的长度就是图的直径
- Bisection Bandwidth: bandwidth across smallest cut that divides network into two equal halves
- Bandwidth across "narrowest" part of the network
- Bisection bandwidth is important for algorithms in which all processors need to communicate with all others
- Represents the worst case for all-to-all communication
Bisection Bandwidth
双分带宽衡量的是 worst-case(最坏情况)的网络性能。具体来说:
双分带宽是将网络分成两个相等大小的部分时,这两部分之间可用的最小带宽。它通过寻找所有可能的对半分割方式中,具有最小带宽的那种分割来衡量网络性能。
双分带宽被设计用来提供网络性能的下限评估:
- 它代表了网络在最不利分割情况下的带宽容量
- 它反映了网络在最坏情况下的通信能力
- 它可以用来预测网络在高负载情况下可能出现的瓶颈
然而,需要注意的是,双分带宽并不总是能准确反映网络的实际性能。在某些情况下,实际网络吞吐量可能:
- 高于双分带宽预测值,因为进程到主机的实际映射可能比最坏情况更优
- 与双分带宽的增长速率不同,这取决于网络拓扑结构
Performance Properties of a Network: Bisection Bandwidth
虽然绿线确实将网格划分成了相等数量的节点,但它不是最优的双分带宽切割,因为:
切割边数过多
- 绿线的切割方式穿过了更多的链路
- 这种切割方式需要切断更多的连接
- 产生的带宽开销大于蓝线的切割方式
最优解特征
- 以最少的切割边数(\(\sqrt{p}\) 条边)将网络分成相等的两部分
- 产生最小的带宽开销
- 这就是为什么双分带宽定义为\(\sqrt{p} \times\) 链路带宽
所以虽然绿线确实将网络分成了相等的两部分,但它不是双分带宽,因为双分带宽寻求的是所有可能的对等分割中需要切断最少链路的那种分割方式。
Network Topology¶
Linear and Ring
Meshes and Tori
Hypercubes
Number of nodes \(n = 2^d\) for dimension d.
- Diameter = d.
- Bisection bandwidth = n/2.
Greycode addressing:
- Each node connected to d others with 1 bit different.
Trees
Butterflies
Dragonflies
- Combine in hierarchy
- Several groups are connected together using all to all links, i.e. each group has at least one link directly to each other group.
- The topology inside each group can be any topology.
- Uses a randomized routing algorithm
- Outcome: programmer can (usually) ignore topology, get good performance
- Important in virtualized, dynamic environment.
- Drawback: variable performance.
My View:
- 任意topo套娃,适合虚拟化的动态网络负载,但是性能不稳定。
- 程序员不需要关心内部topo实现,只需要关心接口。
- 采取的是随机路由算法,而不是最短路由,这样可以保证极端流量下的负载均衡
Why Random Routing¶
Minimal routing works well when things are load balanced, potentially catastrophic in adversarial traffic patterns.
Performance Models¶
Alpha-Beta Model¶
- Time = PropagationDelay(\(alpha\)) + n x TransmissionDelay(\(beta\)).
- One long message is cheaper than many short ones.
- Need large computation-to-communication ratio to be efficient.
Current Issue¶
- Processors are multi-core and many nodes are multi-chip.
- NIC bandwidth bottleneck is becoming more common.
Programming: Message Passing¶
Programming Distributed Memory Machines with Message Passing
Basic Line¶
Two important questions that arise early in a parallel program are:
- How many processes are participating in this computation?
- Which one am I?
C | |
---|---|
1 2 |
|
Example:
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
- All MPI programs begin with
MPI_Init
and end withMPI_Finalize
. MPI_COMM_WORLD
is defined bympi.h
(in C) and designates all processes in the MPI "job".- Each statement executes independently in each process.
- including the printf/print statements.
- Use
mpirun –np 4 a.out
to run.-np
specifies the number of processes.4
is the setting number.a.out
is the executable.
Send and Recv¶
Some Basic Concepts
Keywords: process / group / message / context / communicator / WORLD / rank
- Processes can be collected into groups.
- Each message can be sent in a context, and must be received in the same context.
- A group and context together form a communicator.
- A process is identified by its rank in the group associated with a communicator.
- There is a default communicator whose group contains all initial processes, called
MPI_COMM_WORLD
.
MPI Datatypes
The data in a message to send or receive is described by a triple (address, count, datatype)
, where a MPI datatype can be recursively defined.
May hurt performance if datatypes are complex.
C | |
---|---|
1 2 3 |
|
MPI Tags
Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.
MPI Basic (Blocking) Send¶
C | |
---|---|
1 2 |
|
MPI_SEND Parameters
start
: Starting address of the send buffer containing the data to be sentcount
: Number of elements to senddatatype
: MPI datatype of each element (e.g., MPI_INT, MPI_FLOAT)dest
: Rank of the destination process in the communicatortag
: Message tag to identify the message (integer value)comm
: Communicator containing both sending and receiving processes
comm
Of course we need to let comm
the same in MPI_SEND
and MPI_RECV
.
- When
MPI_SEND
function returns, the data has been delivered to the system and the buffer can be reused. - The message may not have been received by the target process.
- Return meaning:
- Sender: "Ok, I have send and I don't care whether you can receive, now I can release these area."
- Receiver: "Oh? Maybe I can receive it or ..."
MPI Basic (Blocking) Receive¶
C | |
---|---|
1 2 |
|
- Waits until a matching (both
source
andtag
) message is received from the system, and the buffer can be used. source
isrank
in communicator specified bycomm
, orMPI_ANY_SOURCE
tag
is atag
to be matched orMPI_ANY_TAG
- receiving fewer than
count
occurrences of datatype is OK, but receiving more is an error- Sender: I send 100 packets
- Receiver:
- I receive 100 packets, we are all good!
- I receive only receive 10 packets, there might be some packet loss.
- I receive 101 packets? Are you kidding me?
status
contains further information (e.g. size of message)
Example for Communicating¶
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
One code, but 2 users:
- The code is for both
sender
andrecv
processes. - Different process will have different response with this code, because of the
if rank
statement.
Retrieving Further Information¶
As mentioned above, Status
is a data structure allocated in the user' s program.
C | |
---|---|
1 2 3 4 5 6 |
|
Most MPI applications can be written with only 6 functions (although which 6 may differ).
- Point-to-point communication:
C 1 2 3 4 5 6
MPI_INIT // start line MPI_FINALIZE // end line MPI_COMM_SIZE // total number of processes MPI_COMM_RANK // current rank (ID) MPI_SEND // send data MPI_RECEIVE // receive data
- Collectives:
C 1 2 3 4 5 6
MPI_INIT // start line MPI_FINALIZE // end line MPI_COMM_SIZE // total number of processes MPI_COMM_RANK // current rank (ID) MPI_BCAST // broadcast ope MPI_REDUCE // reduce ope
Blocking and Non-blocking Communication¶
Currently, we introduce so many blocking
communication functions:
Unsafe Code
Point: If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive)
- 两个进程都先尝试发送数据(Send),然后才接收数据(Recv)
- 如果系统缓冲区不足,两个进程都会在Send操作处阻塞
- 由于双方都在等待对方接收数据,但又都无法进入接收状态,从而形成死锁
Solution 1
(1)重排操作顺序的解决方案
在第一种解决方案中,通过调整通信操作的顺序来避免死锁:
- Process 0: 先Send后Recv
- Process 1: 先Recv后Send
这种方式之所以有效,是因为:
- 确保了至少有一个进程能够接收消息
- ==避免了两个进程同时阻塞在Send操作==上的情况
- 建立了明确的消息传递顺序
(2) 使用非阻塞通信的解决方案
第二种解决方案使用MPI_Sendrecv
函数:
- Process 0:
Sendrecv(1)
- Process 1:
Sendrecv(0)
这种方式有效的原因是:
Sendrecv
是一个==原子操作==,同时处理发送和接收- 内部实现会自动处理缓冲区管理
Solution 2
We will introduce B / R in the below part.
Non-blocking Operations¶
非阻塞操作会立即返回一个"请求句柄"(request handle),这个句柄可以用来测试和等待操作的完成。主要涉及以下几个关键部分:
基本数据类型
MPI_Request
:用于存储请求信息MPI_Status
:用于存储状态信息
主要函数
MPI_Isend
:非阻塞发送数据MPI_Irecv
:非阻塞接收数据MPI_Wait
:等待请求完成MPI_Test
:测试请求是否完成
重要说明
- 每个请求必须通过
MPI_Wait
进行等待处理 - 可以使用
MPI_Test
在不等待的情况下测试请求是否完成 - 特别注意:在请求完成之前访问数据缓冲区是未定义行为,可能导致程序出错
这些操作的设计目的是提高程序的并行效率,允许在通信的同时执行其他计算任务。
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Multiple Completions¶
Sometimes it's desirable to wait on multiple requests:
(1) MPI_Waitall: wait for all requests to complete
C | |
---|---|
1 |
|
Need all requests to complete before continuing.
Text Only | |
---|---|
1 2 3 |
|
相当于服务员说:"我要等所有顾客都点完餐,才开始处理所有订单"
Return Value
MPI_SUCCESS
:如果所有请求都成功完成MPI_ERR_IN_STATUS
:如果一个或多个请求失败。在这种情况下,每个请求的具体错误状态会记录在array_of_statuses
中- 其他错误码:如果因其他原因(如参数无效)失败
(2) MPI_Waitany: wait for any request to complete
C | |
---|---|
1 |
|
Wait for any request to complete, and return the index of the completed request.
Text Only | |
---|---|
1 2 3 4 |
|
相当于服务员说:"谁先点完我就先处理谁的订单"
Return Value
MPI_SUCCESS
:成功完成一个请求index
:返回完成的请求在数组中的索引
PS: 如果没有活跃的请求,index
会被设置为 MPI_UNDEFINED
(3) MPI_Waitsome: wait for some requests to complete
C | |
---|---|
1 |
|
If some processes are done, I will return the indices of the completed requests. And resting processes will be done then.
C | |
---|---|
1 2 3 |
|
相当于服务员说:"我看看现在有哪些顾客点完了,把他们的订单一起处理了,余下的稍后看"
Return Value
MPI_SUCCESS
:如果成功完成一个或多个请求outcount
:返回完成的请求数量array_of_indices
:返回已完成请求的索引数组MPI_ERR_IN_STATUS
:如果一个或多个操作失败
There are corresponding versions of test for each of these, ignored here.
Communication Modes¶
For a Sender
- Blocking versions
- Synchronous mode (
MPI_Ssend
): 普通模式- the send does not complete until a matching receive has begun.
- may lead to Unsafe programs deadlock.
- Buffered mode (
MPI_Bsend
): 大款用户- the user supplies a buffer to the system for its use.
- user allocates enough memory to make an unsafe program safe.
- Ready mode (
MPI_Rsend
): 使用Fast Protocol- user guarantees that a matching receive has been posted.
- allow access to fast protocols.
- undefined behavior if matching receive not posted.
- Synchronous mode (
- Non-blocking versions
MPI_Issend
For a Receiver
MPI_Recv
receives messages sent in any mode.