Lecture 4 Shared Memory Programming: Mostly OpenMP¶
Overview¶
With the development of Parallel Computing, the trend and corresponding language are changing all the time:
- Vector Machine:
- IVDEP
- SIMD Machine:
- Data Parallel Language: SIMD ...
- Shared Memory Machine (SMPs):
- Shared Memory Programming: OpenMP ...
- Cluster (HPC Model):
- Message Passing (MPI) became dominant
- Additional Trends:
- Accelerators: OpenACC, CUDA
- Cloud Computing: Hadoop, SPARK
We will mention each topic in the near future.
Recall: Shared Memory Model
- Program is a set of threads of control.
- Each thread has a set of private variables.
- eg. Local Stack Variables
- Also has a set of public variables.
- eg. Global Heap,
static
variables
- eg. Global Heap,
Two important statements:
- Threads communicate implicitly by writing and reading shared variables.
- Threads coordinate by synchronizing on shared variables.
Parallel Programming with Threads¶
POSIX Threads
POSIX
: Portable Operating System Interface – Interface to Operating System utilities- PThreads: The POSIX threading interface – System calls to create and synchronize threads
POSIX
Here, we introduce POSIX
programming language, mainly focus on PThreads
.
PThreads contain support for:
- Creating Parallelism
- Synchronizing
- No explicit support for communication
- because shared memory is implicit; a pointer to shared data is passed to a thread
C | |
---|---|
1 2 3 4 5 6 7 8 |
|
thread_id
is the thread id or handle (used to halt, etc.)thread_attribute
refers to various attributes- Default Values: a
NULL
pointer - Sample Attributes: minimum stack size, priority
- Default Values: a
thread_fun
: the function to be runfun_arg
: an argument can be passed tothread_fun
when it startserrorcode
will be set nonzero if the create operation fails (Recall:nil
inGolang
)
Examples:
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Compile using gcc –lpthread
Recall: Race Condition
A race condition or data race occurs when:
- Two processors (or two threads) access the same variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so they could happen simultaneously.
Basic Types of Synchronization: Mutexes
Mutexes, aka. Locks (互斥锁)
C | |
---|---|
1 2 3 4 5 |
|
Locks only affect the processors currently using them.
Materials concerning 🔒
Lock Mechanism
Key Properties:
- A lock only prevents access from other processors that explicitly try to acquire that specific lock
- Simply holding a lock does not automatically prevent other threads from accessing the protected data
- All code that accesses shared data must use the same lock for protection to be effective
Critical Implications
-
Protection Requirements:
- Every access to shared data must be guarded by acquiring the appropriate lock first
- If a thread accesses shared data without acquiring the necessary lock, it can still modify the data regardless of whether other threads hold locks
-
Implementation Details:
- Modern processors use cache coherency protocols (like MESI) to implement atomic operations
- The LOCK# signal was historically used to prevent other processors from accessing memory during critical operations
- Hardware memory barriers ensure that lock operations are properly synchronized across multiple processors
Therefore, a good habit is: Always acquire the correct lock before accessing shared resources
There is a item called Semaphore
that is related to Mutex
, but they are a little bit different:
Semaphores
(信号量) generalize locks to allow k threads simultaneous access to a resource.Mutex
can only be unlocked by its owner (the thread that acquired it).Semaphores
can be decremented by any process that has access to the semaphore.
Syntax of Mutex
To create a mutex:
C | |
---|---|
1 2 3 |
|
To use it:
C | |
---|---|
1 2 |
|
To deallocate a mutex:
C | |
---|---|
1 |
|
Multiple mutexes may be held, but can lead to problems:
C | |
---|---|
1 2 3 4 |
|
Deadlock results if both threads acquire one of their locks, so that neither can acquire the second
Why Deadlock Happens Here
时序分析
- Thread1 获取锁a
- Thread2 获取锁b
- Thread1 尝试获取锁b (被Thread2持有), 进入等待
- Thread2 尝试获取锁a (被Thread1持有), 进入等待
OpenMP¶
OpenMP = Open specification for Multi-Processing
This picture is all you need when programming:
Architecture:
Based on this architecture, you can just use the OpenMP syntax instead of worrying about the underlay code-bases.
Overview (highly abstracted)
C | |
---|---|
1 2 |
|
C | |
---|---|
1 2 3 4 |
|
C | |
---|---|
1 2 3 4 5 |
|
example:
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Fork-Join Model:
Thread creation: Parallel regions¶
How to assign thread numbers
C | |
---|---|
1 2 |
|
How to get current thread ID
C | |
---|---|
1 2 |
|
How to get total number of threads
C | |
---|---|
1 2 |
|
Each thread executes a copy of the code within the structured block.
What's the real thread number
C | |
---|---|
1 2 |
|
But is the number of threads requested the number you actually get?
– NO! An implementation can silently decide to give you a team with fewer threads. – Once a team of threads is established … the system will not reduce the size of the team.
Time Calculation¶
C | |
---|---|
1 |
|
The start time is random, but in one program, start time is the same.
So we usually use like this to calculate time duration:
C | |
---|---|
1 2 3 4 5 |
|
Shared Memory Hardware and Memory Consistency¶
Cache Coherence
- We use cache to reduce memory access time
- We use write-back policy, which means, we only write back to memory when the cache line is full or the cache line is evicted.
- In this scenario, it's easily happened that the one thread writes to the cache line and sync with memory, but the other thread reads from its own previous cache.
- We call this: Cache Coherence
We ignore the detailed protocol here since it's not related to this class.
False Sharing
If independent data elements happen to sit on the same cache line, each update will cause the cache lines to "slosh back and forth" between threads ... This is called "false sharing".
Synchronization¶
High level synchronization included in the common core (the full OpenMP
specification has MANY more):
Here we just focus on:
– critical – barrier
Synchronization: critical¶
Mutual exclusion: Only one thread at a time can enter a critical region
Threads wait their turn – only one at a time calls consume()
C | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
eg. There are 4 threads (A / B / C / D) working on a looping issue.
Assume the speed is A > B > C = D.
The resume is added by: A -> B.
Since V(C) equals to V(D), C and D will be blocked by critical
(only one at a time calls).
Now critical
is a schemer and let C go first, then comes to D.
Synchronization: barrier¶
A "Stand-Alone" Pragma
Nobody can past the barrier line until all the threads come to this point.
C | |
---|---|
1 2 3 4 5 6 |
|
eg. There are 4 threads (A / B / C / D) working on a looping issue.
Assume the speed is A > B > C = D.
When A comes, it needs to wait. When B comes, it needs to wait. When C comes, it needs to wait. When D comes, since all threads hit to this line, they can all go through.
Now all the threads are at the same point.
Loop Optimization¶
C | |
---|---|
1 2 3 4 5 6 7 |
|
Tips:
- All threads are working on this loop together, different
index(i)
is assigned to different threads. - The loop control index
i
is made "private" to each thread by default. - Threads wait at the end of
parallel
until all threads are finished with the parallel loop before any proceed past the end of the loop.
You can also combine 2 pragmas into 1 for simplification:
C | |
---|---|
1 2 3 4 5 |
|
PS: At the end of #pragma omp parallel
area, there is a implicit barrier.
The reason is:
- All threads complete their assigned loop iterations
- No thread can prematurely exit the parallel region
- Maintains data consistency and correct execution order
Schedule Clause¶
Static Scheduling:
- Schedule determined at compile time
- Allocates fixed-size chunks of iterations to threads
- Best for:
- Predictable workload per iteration
- Predetermined work distribution
- Minimal runtime overhead since scheduling is done at compile time
Dynamic Scheduling:
- Schedule determined at runtime
- Threads grab chunks of iterations from a queue dynamically
- Best for:
- Unpredictable workload per iteration
- Highly variable work distribution
- Complex scheduling logic handled during execution
Region Division in Loop¶
C | |
---|---|
1 |
|
The collapse(number)
clause combines the outer two loops, maybe providing better load balancing and parallel efficiency across the available threads.
But it's double-edged sword.
Advantages
Increased Parallelism:
- Enlarges iteration space, providing more parallel opportunities
- Improves load balancing across threads
- Particularly effective when outer loops are small and inner loops are large
Limitations
- Code Structure Constraints:
- No code can exist between collapsed loops
- Excessive collapsing may:
- Inhibit vectorization optimizations
- Increase scheduling overhead
- Reduce cache efficiency
The optimal collapse value should be determined through performance testing and careful consideration of the specific application requirements and hardware architecture.
How to choose number
One advice is that: try different number from 1, and see which one is the best :)
C | |
---|---|
1 2 3 4 5 6 7 8 |
|
collapse(2)
: Combine the outer 2 loops into 1 area.schedule(dynamic)
: Use dynamic assignment#pragma omp parallel for
: let thefor-loop
be done by multiple threads
Reduction¶
A problem till now is:
C | |
---|---|
1 2 3 4 5 6 7 8 |
|
There is a true dependence between loop iterations that can’t be trivially removed:
- We need to combine values into a single accumulative variable.
- If every thread updates the same variable, we need to ensure that the updates are done in a thread-safe manner.
Hence we need reduction
!
C | |
---|---|
1 2 |
|
Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for “+”). – Updates occur on the local copy. – Local copies are reduced into a single value and combined with the original global value.
C | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
reduction(op:list)
基本语法:
reduction(+:ave)
表示对变量ave进行加法归约操作- 每个线程会创建ave的私有副本
- 在并行区域结束时自动合并所有线程的结果
执行过程:
- 初始化:每个线程创建ave的本地副本,初始值为0
- 计算:每个线程在自己的私有副本上累加
- 合并:所有线程完成后,自动将所有私有副本相加得到最终结果
Optimized by nowait
¶
Barriers are really expensive. You need to understand when they are implied and how to skip them when its safe to do so.
Here we need to consider a historical issue:
What happened if there are many omp statements after 1
omp parallel
statement? (all of them are peer-level)
Where are implicit barriers?
- 在parallel区域结束时有一个隐式barrier
- 在for构造结束时有一个隐式barrier (除非使用
nowait
子句) - 在single构造结束时有一个隐式barrier (除非使用
nowait
子句) - 在sections构造结束时有一个隐式barrier (除非使用
nowait
子句)
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Data Sharing¶
Default storage attributes:
- Most variables are
shared
by default - Global variables:
shared
among threads - Stack variables in functions(C) called from parallel regions are
private
to each thread - Automatic variables within a statement block are
private
.
We can selectively change storage attributes for constructs using the following clauses:
shared(list)
private(list)
firstprivate(list)
Private
clause¶
private(var)
creates a new local copy of var for each thread.
– The value of the private copies is uninitialized – The value of the original variable is unchanged after the region
Obviously, the usage above is not as our expectation. We need improvement!
Hence we suppose firstprivate
clause.
Firstprivate
clause¶
- Variables initialized from a shared variable
- C++ objects are copy-constructed
C | |
---|---|
1 2 3 4 5 6 7 |
|
Each thread gets its own copy of incr
(global defined above) with an initial value of 0.
Take-Away:
Tasks¶
Single
clause¶
The single construct denotes a block of code that is executed by only one thread (not necessarily the master thread).
A barrier is implied at the end of the single block (can remove the barrier with a nowait
clause).
只安排给一个人做,做完就强制停止
C | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Task
directive¶
C | |
---|---|
1 |
|
This clause is used to specify a task to be executed by one of the threads in the team.
You can think it as a "manager" or "controller"
Example:
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Be careful
Task指令:
#pragma omp task
用于创建异步任务- 任务可以被线程池中的任意线程执行
- 任务之间的执行顺序不确定
- 适用于不规则的并行工作负载
Single指令:
#pragma omp single
确保代码块只被一个线程执行- 其他线程会等待在single区域的末尾
- 通常用于初始化或任务分发
- 包含隐式barrier(除非使用nowait子句)
工作方式:
- parallel创建线程组
- single确保只有一个线程执行任务创建
- 创建的任务(fred, daisy, billy)可以被任意线程异步执行
- 所有任务完成后才能越过single区域的barrier
这种组合适合处理动态和不规则的并行工作负载。
taskwait
directive and taskgroup
region¶
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|