Lecture 4 Shared Memory Programming: Mostly OpenMP¶

Overview¶

With the development of Parallel Computing, the trend and corresponding language are changing all the time:

Vector Machine:
- IVDEP
SIMD Machine:
- Data Parallel Language: SIMD ...
Shared Memory Machine (SMPs):
- Shared Memory Programming: OpenMP ...
Cluster (HPC Model):
- Message Passing (MPI) became dominant
Additional Trends:
- Accelerators: OpenACC, CUDA
- Cloud Computing: Hadoop, SPARK

We will mention each topic in the near future.

Recall: Shared Memory Model

Program is a set of threads of control.
Each thread has a set of private variables.
- eg. Local Stack Variables
Also has a set of public variables.
- eg. Global Heap, static variables

Two important statements:

Threads communicate implicitly by writing and reading shared variables.
Threads coordinate by synchronizing on shared variables.

Parallel Programming with Threads¶

POSIX Threads

POSIX: Portable Operating System Interface – Interface to Operating System utilities
PThreads: The POSIX threading interface – System calls to create and synchronize threads

POSIX

Here, we introduce POSIX programming language, mainly focus on PThreads.

PThreads contain support for:

Creating Parallelism
Synchronizing
No explicit support for communication
- because shared memory is implicit; a pointer to shared data is passed to a thread

C
// format
int pthread_create(pthread_t *,
                   const pthread_attr_t *,
                   void * (*)(void *),
                   void *);

// usage
errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg);

thread_id is the thread id or handle (used to halt, etc.)
thread_attribute refers to various attributes
- Default Values: a NULL pointer
- Sample Attributes: minimum stack size, priority
thread_fun: the function to be run
fun_arg: an argument can be passed to thread_fun when it starts
errorcode will be set nonzero if the create operation fails (Recall: nil in Golang)

Examples:

C
void* SayHello(void *foo) { 
    printf( "Hello, world!\n" ); 
    return NULL; 
}

int main() {
    pthread_t threads[16]; 
    int tn;

    for(tn=0; tn<16; tn++) {
        pthread_create(&threads[tn], NULL, SayHello, NULL);
    }

    for(tn=0; tn<16 ; tn++) {
        pthread_join(threads[tn], NULL);
    }

    return 0;
}

Compile using gcc –lpthread

Recall: Race Condition

A race condition or data race occurs when:

Two processors (or two threads) access the same variable, and at least one does a write.
The accesses are concurrent (not synchronized) so they could happen simultaneously.

Basic Types of Synchronization: Mutexes

Mutexes, aka. Locks (互斥锁)

C
lock* l = alloc_and_init_lock(); // shared, define a lock
acquire(l); // lock now
// critical section:
// access shared variables
release(l); // release the lock

Locks only affect the processors currently using them.

Materials concerning 🔒

Lock Mechanism

Key Properties:

A lock only prevents access from other processors that explicitly try to acquire that specific lock
Simply holding a lock does not automatically prevent other threads from accessing the protected data
All code that accesses shared data must use the same lock for protection to be effective

Critical Implications

Protection Requirements:
- Every access to shared data must be guarded by acquiring the appropriate lock first
- If a thread accesses shared data without acquiring the necessary lock, it can still modify the data regardless of whether other threads hold locks
Implementation Details:
- Modern processors use cache coherency protocols (like MESI) to implement atomic operations
- The LOCK# signal was historically used to prevent other processors from accessing memory during critical operations
- Hardware memory barriers ensure that lock operations are properly synchronized across multiple processors

Therefore, a good habit is: Always acquire the correct lock before accessing shared resources

There is a item called Semaphore that is related to Mutex, but they are a little bit different:

Semaphores(信号量) generalize locks to allow k threads simultaneous access to a resource.
Mutex can only be unlocked by its owner (the thread that acquired it).
Semaphores can be decremented by any process that has access to the semaphore.

Syntax of Mutex

To create a mutex:

C
#include <pthread.h> 
pthread_mutex_t amutex = PTHREAD_MUTEX_INITIALIZER; 
// or pthread_mutex_init(&amutex, NULL);

To use it:

C
int pthread_mutex_lock(amutex); 
int pthread_mutex_unlock(amutex);

To deallocate a mutex:

C
int pthread_mutex_destroy(pthread_mutex_t *mutex);

Multiple mutexes may be held, but can lead to problems:

C
// Deadlock Scenario
thread1 lock(a) lock(b)

thread2 lock(b) lock(a)

Deadlock results if both threads acquire one of their locks, so that neither can acquire the second

Why Deadlock Happens Here

时序分析

Thread1 获取锁a
Thread2 获取锁b
Thread1 尝试获取锁b (被Thread2持有), 进入等待
Thread2 尝试获取锁a (被Thread1持有), 进入等待

OpenMP¶

OpenMP = Open specification for Multi-Processing

This picture is all you need when programming:

alt text

Architecture:

alt text

Based on this architecture, you can just use the OpenMP syntax instead of worrying about the underlay code-bases.

Overview (highly abstracted)

C
// header
#include <omp.h>

C
// omp statement
#pragma omp construct [clause [clause]…]

// will automatically choose region based on given arguments

C
// omp region
#pragma omp parallel private(x) 
{

}

example:

C
#include<stdio.h>
#include<omp.h> // openmp included file

int main() 
{
    // Parallel region with default number of threads (4)
    #pragma omp parallel // region start
    {
        int ID = omp_get_thread_num(); // current thread ID
        printf("hello(%d)", ID);
        printf("world(%d)\n", ID);
    } // region end
}

Fork-Join Model:

alt text

Thread creation: Parallel regions¶

How to assign thread numbers

C
omp_set_num_threads(4);
// runtime function

How to get current thread ID

C
int thID = omp_get_thread_num();
// runtime function

How to get total number of threads

C
int totalThNum = omp_get_num_threads();
// runtime function

Each thread executes a copy of the code within the structured block.

alt text

What's the real thread number

C
int totalThNum = omp_get_num_threads();
// runtime function

But is the number of threads requested the number you actually get?

– NO! An implementation can silently decide to give you a team with fewer threads. – Once a team of threads is established … the system will not reduce the size of the team.

Time Calculation¶

C
int timeStamp = omp_get_wtime();

The start time is random, but in one program, start time is the same.

So we usually use like this to calculate time duration:

C
int timeStamp_1 = omp_get_wtime();
...
int timeStamp_2 = omp_get_wtime();

int duration = timeStamp_2 - timeStamp_1;

Shared Memory Hardware and Memory Consistency¶

Cache Coherence

alt text

We use cache to reduce memory access time
We use write-back policy, which means, we only write back to memory when the cache line is full or the cache line is evicted.
In this scenario, it's easily happened that the one thread writes to the cache line and sync with memory, but the other thread reads from its own previous cache.
We call this: Cache Coherence

alt text

We ignore the detailed protocol here since it's not related to this class.

False Sharing

If independent data elements happen to sit on the same cache line, each update will cause the cache lines to "slosh back and forth" between threads ... This is called "false sharing".

alt text

Synchronization¶

High level synchronization included in the common core (the full OpenMP specification has MANY more):

Here we just focus on:

– critical – barrier

Synchronization: critical¶

Mutual exclusion: Only one thread at a time can enter a critical region

Threads wait their turn – only one at a time calls consume()

C
#pragma omp parallel
{
    // ...
    #pragma omp critical
    {
        // critical region, eg:
        resume += consume(...);
    }
}

eg. There are 4 threads (A / B / C / D) working on a looping issue.

Assume the speed is A > B > C = D.

The resume is added by: A -> B.

Since V(C) equals to V(D), C and D will be blocked by critical (only one at a time calls).

Now critical is a schemer and let C go first, then comes to D.

Synchronization: barrier¶

A "Stand-Alone" Pragma

Nobody can past the barrier line until all the threads come to this point.

C
#pragma omp parallel
{
    // ...
    #pragma omp barrier // barrier line
    total = func(id);
}

eg. There are 4 threads (A / B / C / D) working on a looping issue.

Assume the speed is A > B > C = D.

When A comes, it needs to wait. When B comes, it needs to wait. When C comes, it needs to wait. When D comes, since all threads hit to this line, they can all go through.

Now all the threads are at the same point.

Loop Optimization¶

C
#pragma omp parallel
{
    #pragma omp for
    for (int i=0; i<N; i++) {
        // ...
    } // See Tips
} // You can think there is a barrier at the end of `parallel`.

Tips:

All threads are working on this loop together, different index(i) is assigned to different threads.
The loop control index i is made "private" to each thread by default.
Threads wait at the end of parallel until all threads are finished with the parallel loop before any proceed past the end of the loop.

You can also combine 2 pragmas into 1 for simplification:

C
// simplification:
#pragma omp parallel for
    for (int i=0; i<N; i++) {
        // ...
    }

PS: At the end of #pragma omp parallel area, there is a implicit barrier.

The reason is:

All threads complete their assigned loop iterations
No thread can prematurely exit the parallel region
Maintains data consistency and correct execution order

Schedule Clause¶

Static Scheduling:

Schedule determined at compile time
Allocates fixed-size chunks of iterations to threads
Best for:
- Predictable workload per iteration
- Predetermined work distribution
- Minimal runtime overhead since scheduling is done at compile time

Dynamic Scheduling:

Schedule determined at runtime
Threads grab chunks of iterations from a queue dynamically
Best for:
- Unpredictable workload per iteration
- Highly variable work distribution
- Complex scheduling logic handled during execution

Region Division in Loop¶

C
collapse(number)

The collapse(number) clause combines the outer two loops, maybe providing better load balancing and parallel efficiency across the available threads.

But it's double-edged sword.

Advantages

Increased Parallelism:

Enlarges iteration space, providing more parallel opportunities
Improves load balancing across threads
Particularly effective when outer loops are small and inner loops are large

Limitations

Code Structure Constraints:
- No code can exist between collapsed loops
Excessive collapsing may:
- Inhibit vectorization optimizations
- Increase scheduling overhead
- Reduce cache efficiency

The optimal collapse value should be determined through performance testing and careful consideration of the specific application requirements and hardware architecture.

How to choose number

One advice is that: try different number from 1, and see which one is the best :)

C
#pragma omp parallel schedule(dynamic) for collapse(2)
for (int i=0; i<B; i++) {
    for (int j=0; j<B; j++) {
        for (int k=0; k<B; k++) {
            SpGEMM(A(i,k), B(k,j), C(i,j));
        }
    }
}

collapse(2): Combine the outer 2 loops into 1 area.
schedule(dynamic): Use dynamic assignment
#pragma omp parallel for: let the for-loop be done by multiple threads

Reduction¶

A problem till now is:

C
double ave=0.0, A[MAX]; 
int i;

for (i=0;i< MAX; i++) {
    ave + = A[i];
} 

ave = ave / MAX;

There is a true dependence between loop iterations that can’t be trivially removed:

We need to combine values into a single accumulative variable.
If every thread updates the same variable, we need to ensure that the updates are done in a thread-safe manner.

Hence we need reduction!

C
// format
reduction(op:list)

Inside a parallel or a work-sharing construct:

– A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for “+”). – Updates occur on the local copy. – Local copies are reduced into a single value and combined with the original global value.

C
// correct answer
double ave=0.0, A[MAX]; 
int i;

#pragma omp parallel for reduction(+:ave)
for (i=0;i< MAX; i++) {
    ave + = A[i];
} 
ave = ave / MAX;

reduction(op:list)

基本语法:

reduction(+:ave) 表示对变量ave进行加法归约操作
每个线程会创建ave的私有副本
在并行区域结束时自动合并所有线程的结果

执行过程:

初始化：每个线程创建ave的本地副本，初始值为0
计算：每个线程在自己的私有副本上累加
合并：所有线程完成后，自动将所有私有副本相加得到最终结果

Optimized by `nowait`¶

Barriers are really expensive. You need to understand when they are implied and how to skip them when its safe to do so.

alt text

Here we need to consider a historical issue:

What happened if there are many omp statements after 1 omp parallel statement? (all of them are peer-level)

Where are implicit barriers?

在parallel区域结束时有一个隐式barrier
在for构造结束时有一个隐式barrier (除非使用nowait子句)
在single构造结束时有一个隐式barrier (除非使用nowait子句)
在sections构造结束时有一个隐式barrier (除非使用nowait子句)

C
#pragma omp parallel
{
    #pragma omp for        // barrier 1
    for(...) { }

    #pragma omp sections   // barrier 2
    {
        #pragma omp section
        { }
    }

    #pragma omp single     // barrier 3
    { }
}                         // barrier 4 (barrier after parallel area)

Default storage attributes:

Most variables are shared by default
Global variables: shared among threads
Stack variables in functions(C) called from parallel regions are private to each thread
Automatic variables within a statement block are private.

We can selectively change storage attributes for constructs using the following clauses:

shared(list)
private(list)
firstprivate(list)

`Private` clause¶

private(var) creates a new local copy of var for each thread.

– The value of the private copies is uninitialized – The value of the original variable is unchanged after the region

alt text

Obviously, the usage above is not as our expectation. We need improvement!

Hence we suppose firstprivate clause.

`Firstprivate` clause¶

Variables initialized from a shared variable
C++ objects are copy-constructed

C
incr = 0; // global variable

#pragma omp parallel for firstprivate(incr) 
for (i = 0; i <= MAX; i++) {
    if ((i%2)==0) incr++;
    A[i] = incr; 
}

Each thread gets its own copy of incr(global defined above) with an initial value of 0.

Take-Away:

alt text

Tasks¶

alt text

`Single` clause¶

The single construct denotes a block of code that is executed by only one thread (not necessarily the master thread).

A barrier is implied at the end of the single block (can remove the barrier with a nowait clause).

只安排给一个人做，做完就强制停止

C
#pragma omp parallel 
{

    do_many_things(); 

    #pragma omp single{ exchange_boundaries(); }

    do_many_other_things();
}

`Task` directive¶

C
#pragma omp task [clauses]

This clause is used to specify a task to be executed by one of the threads in the team.

You can think it as a "manager" or "controller"

Example:

C
#pragma omp parallel // Create some threads to be ready
{ 

    #pragma omp single // Only One Thread (A, boss) need to assign these tasks to all the teammates
    { 
        #pragma omp task // Tasks executed by some thread in some order
            fred(); // A: I am boss, this task is assigned to B
        #pragma omp task 
            daisy(); // A: I am boss, this task is assigned to C
        #pragma omp task 
            billy(); // A: I am boss, this task is assigned to D
    }
} // All task assignments are complete before this barrier is released

Be careful

Task指令:

#pragma omp task用于创建异步任务
任务可以被线程池中的任意线程执行
任务之间的执行顺序不确定
适用于不规则的并行工作负载

Single指令:

#pragma omp single确保代码块只被一个线程执行
其他线程会等待在single区域的末尾
通常用于初始化或任务分发
包含隐式barrier(除非使用nowait子句)

工作方式：

parallel创建线程组
single确保只有一个线程执行任务创建
创建的任务(fred, daisy, billy)可以被任意线程异步执行
所有任务完成后才能越过single区域的barrier

这种组合适合处理动态和不规则的并行工作负载。

`taskwait` directive and `taskgroup` region¶

alt text

C
#pragma omp parallel 
{

    #pragma omp single 
    { 
        #pragma omp task 
            fred(); 
        #pragma omp task 
            daisy(); 
        #pragma taskwait // fred() and daisy() must complete before billy() starts
        #pragma omp task 
            billy();
    }
}

alt text

C
#pragma omp parallel 
{
    #pragma omp single 
    { 
        #pragma omp taskgroup  // start task group
        {
            #pragma omp task // child 1
                fred(); 
            #pragma omp task // child 2
                daisy(); 
        } // end task group, need both 1 and 2 to complete before billy() starts

        #pragma omp task 
            billy();
    }
}

Lecture 4 Shared Memory Programming: Mostly OpenMP¶

Overview¶

Parallel Programming with Threads¶

OpenMP¶

Thread creation: Parallel regions¶

Time Calculation¶

Shared Memory Hardware and Memory Consistency¶

Synchronization¶

Synchronization: critical¶

Synchronization: barrier¶

Loop Optimization¶

Schedule Clause¶

Region Division in Loop¶

Reduction¶

Optimized by nowait¶

Data Sharing¶

Private clause¶

Firstprivate clause¶

Tasks¶

Single clause¶

Task directive¶

taskwait directive and taskgroup region¶

Optimized by `nowait`¶

`Private` clause¶

`Firstprivate` clause¶

`Single` clause¶

`Task` directive¶

`taskwait` directive and `taskgroup` region¶