Lecture 7 Introduction to Graphics Processing Units (GPUs)¶

Overview for CPU and GPU¶

CPU goal: minimum latency
GPU goal: maximum throughput

alt text

Deal with Stalls¶

As is known to all, stalls are the enemy of performance.

Stalls: a core cannot run the next instruction because of a dependency on a previous one
Memory operations are 100s-1000s of cycles
Removed the fancy caches and prefetch logic that helps avoid stalls

So GPU takes a huge amount of threads to make a full use.

alt text

GPUs in Practice¶

We need to pay attention to SM (Stream Multiprocessor).

alt text

Programming GPUs¶

Heterogeneous Programming (CPU + GPU)¶

alt text

Scenario: Only CPU works

alt text

Running GPU Code (Kernel)

You can consider CPU as a boss, and GPU is a high-throughput worker.

So to write CPU + GPU code, we need to stand on the perspective of CPU, the process is:

Allocate memory on GPU
Copy data to GPU
Execute GPU program
Wait to complete
Copy results back to CPU

alt text

C
add<<<1,1>>>(N, d_x, d_y);

We may be confused about this <<<1,1>>>.

The first 1 means the number of blocks
The second 1 means the number of threads per block

We will introduce it then.

How to Write GPU Code¶

C
// Run kernel on GPU
add<<<1,256>>>(N, d_x, d_y);

Threads of this block (256 here) must be <= 1024 on Volta (previously 512)

C
// GPU function to add two vectors 
__global__ 
void add(int n, float *x, float *y) 
{ 
    int index = threadIdx.x;
    int stride = blockDim.x; // CUDA’s # threads
    for (int i = index; i < n; i+=stride)
        y[i] = x[i] + y[i]; 
    // Works for arbitrary N and # threads / block, only one block
}

GPU Programming Model¶

Grids and Thread Blocks

Grid > Thread Block > Thread

1D 1D¶

alt text

2D 1D¶

alt text

2D 2D¶

alt text

Take-Away¶

Grid is a collection of threads.
Threads in a grid execute a kernel function and are divided into thread blocks.
gridDim: the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

alt text

Shared Memory on Nvidia GPUs¶

alt text

Thread Synchronization¶

Synchronizes all threads within a block

C
void __syncthreads();

Used to prevent RAW / WAR / WAW hazards
All threads in the block must reach the barrier
If used inside a conditional, the condition must be uniform across the block

Review

RAW: Read After Write
WAR: Write After Read
WAW: Write After Write

C
__global__ void stencil_1d(int *in, int *out) {
    // code from earlier slide to setup temp halos…

    // Synchronize (ensure all the data is available) 
    __syncthreads(); // Apply the stencil 
    int result = 0; 
    for (int offset = -RADIUS ; offset <= RADIUS ; offset++) 
        result += temp[lindex + offset]; // Store the result
    out[gindex] = result;
}

Threads within a block may synchronize with barriers

C
… Step 1 … 
__syncthreads(); 
… Step 2 …

Implicit barrier between kernels

C
vec_minus<<<nblocks, blksize>>>(a, b, c); 
// ------------implicit barrier---------------
vec_dot<<<nblocks, blksize>>>(c, c);

CUDA¶

Extensions to C/C++

Declaration specifiers to indicate where things live

C
__global__ void KernelFunc(...); // kernel callable from host
__device__ void DeviceFunc(...); // function callable on device 
__device__ int GlobalVar;  // variable in device memory
__shared__ int SharedVar; // in per-block shared memory

Extend function invocation syntax for parallel kernel launch

C
KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each

Special variables for thread identification in kernels

C
dim3 threadIdx;
dim3 blockIdx;
dim3 blockDim;

Intrinsics that expose specific operations in kernel code
C
1
__syncthreads(); // barrier synchronization

Features available on GPU

Double and single precision
Standard mathematical functions – sinf, powf, atanf, ceil, min, sqrtf, etc.
Atomic memory operations – atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.
These work on both global and shared memory

Runtime support

Explicit memory allocation returns pointers to GPU memory – cudaMalloc(), cudaFree() – cudaMallocManaged();
Explicit memory copy for host -- device, device -- device – cudaMemcpy(), cudaMemcpy2D(), ...

Summary¶

We organize all the hierarchy about GPU here.

Threads:

– Each thread is a SIMD lane (ALU)

Warps:

– A warp executed as a logical SIMD instruction (sort of) – Warp width is 32 elements: LOGICAL SIMD width – (Warp-level programming also possible)

Thread blocks:

– Each thread block is scheduled onto an SM – Peak efficiency requires multiple thread blocks per processor

Kernel:

– Executes on a GPU (there is also multi-GPU programming)

SM and TB

基本架构关系

SM的定义

SM是NVIDIA GPU的基本处理单元，是一个通用处理器，具有较低的时钟频率和较小的缓存
每个GPU架构包含多个SM，例如NVIDIA A100 GPU就包含108个SM

执行关系

一个SM可以同时执行多个线程块
当一个线程块被分配到某个SM上后，必须在该SM上完成执行，不能迁移到其他SM

资源分配

SM的硬件资源

执行核心（单精度浮点单元、双精度浮点单元等）
多级缓存（L1缓存、共享内存、常量缓存、纹理缓存）
Warp调度器
大量寄存器

调度机制

GPU调度器负责将线程块分配给可用的SM
一旦一个SM完成了某个线程块的执行，它会继续处理下一个线程块
为了充分利用GPU，通常需要启动的线程块数量应该是SM数量的4倍以上

性能考虑

并行执行

同一个线程块中的所有线程在被分配到SM后会同时执行
SM通过Warp（32个线程为一组）的方式来管理和执行线程
当一个线程块中的所有线程都执行完毕后，SM才会释放该块的资源

Lecture 7 Introduction to Graphics Processing Units (GPUs)¶

Overview for CPU and GPU¶

Deal with Stalls¶

GPUs in Practice¶

Programming GPUs¶

Heterogeneous Programming (CPU + GPU)¶

How to Write GPU Code¶

GPU Programming Model¶

1D 1D¶

2D 1D¶

2D 2D¶

Take-Away¶

Sharing and Synchronization¶

Shared Memory on Nvidia GPUs¶

Thread Synchronization¶

CUDA¶

Summary¶