Lecture 7 Introduction to Graphics Processing Units (GPUs)¶
Overview for CPU and GPU¶
- CPU goal: minimum latency
- GPU goal: maximum throughput
Deal with Stalls¶
As is known to all, stalls are the enemy of performance.
- Stalls: a core cannot run the next instruction because of a dependency on a previous one
- Memory operations are 100s-1000s of cycles
- Removed the fancy caches and prefetch logic that helps avoid stalls
So GPU takes a huge amount of threads to make a full use.
GPUs in Practice¶
We need to pay attention to SM (Stream Multiprocessor).
Programming GPUs¶
Heterogeneous Programming (CPU + GPU)¶
Scenario: Only CPU works
Running GPU Code (Kernel)
You can consider CPU as a boss, and GPU is a high-throughput worker.
So to write CPU + GPU code, we need to stand on the perspective of CPU, the process is:
- Allocate memory on GPU
- Copy data to GPU
- Execute GPU program
- Wait to complete
- Copy results back to CPU
C | |
---|---|
1 |
|
We may be confused about this <<<1,1>>>
.
- The first 1 means the number of blocks
- The second 1 means the number of threads per block
We will introduce it then.
How to Write GPU Code¶
C | |
---|---|
1 2 |
|
Threads of this block (256 here) must be <= 1024 on Volta (previously 512)
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
GPU Programming Model¶
Grids and Thread Blocks
Grid > Thread Block > Thread
1D 1D¶
2D 1D¶
2D 2D¶
Take-Away¶
- Grid is a collection of threads.
- Threads in a grid execute a kernel function and are divided into thread blocks.
gridDim
: the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.
Sharing and Synchronization¶
Shared Memory on Nvidia GPUs¶
Thread Synchronization¶
Synchronizes all threads within a block
C | |
---|---|
1 |
|
- Used to prevent
RAW
/WAR
/WAW
hazards - All threads in the block must reach the barrier
- If used inside a conditional, the condition must be uniform across the block
Review
- RAW: Read After Write
- WAR: Write After Read
- WAW: Write After Write
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Threads within a block may synchronize with barriers
C | |
---|---|
1 2 3 |
|
Implicit barrier between kernels
C | |
---|---|
1 2 3 |
|
CUDA¶
Extensions to C/C++
- Declaration specifiers to indicate where things live
C 1 2 3 4
__global__ void KernelFunc(...); // kernel callable from host __device__ void DeviceFunc(...); // function callable on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // in per-block shared memory
- Extend function invocation syntax for parallel kernel launch
C 1
KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
- Special variables for thread identification in kernels
C 1 2 3
dim3 threadIdx; dim3 blockIdx; dim3 blockDim;
- Intrinsics that expose specific operations in kernel code
C 1
__syncthreads(); // barrier synchronization
Features available on GPU
- Double and single precision
- Standard mathematical functions
–
sinf
,powf
,atanf
,ceil
,min
,sqrtf
, etc. - Atomic memory operations – atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.
- These work on both global and shared memory
Runtime support
- Explicit memory allocation returns pointers to GPU memory
–
cudaMalloc()
,cudaFree()
–cudaMallocManaged()
; - Explicit memory copy for host -- device, device -- device
–
cudaMemcpy()
,cudaMemcpy2D()
, ...
Summary¶
We organize all the hierarchy about GPU here.
Threads:
– Each thread is a SIMD lane (ALU)
Warps:
– A warp executed as a logical SIMD instruction (sort of) – Warp width is 32 elements: LOGICAL SIMD width – (Warp-level programming also possible)
Thread blocks:
– Each thread block is scheduled onto an SM – Peak efficiency requires multiple thread blocks per processor
Kernel:
– Executes on a GPU (there is also multi-GPU programming)
SM and TB
基本架构关系
SM的定义
- SM是NVIDIA GPU的基本处理单元,是一个通用处理器,具有较低的时钟频率和较小的缓存
- 每个GPU架构包含多个SM,例如NVIDIA A100 GPU就包含108个SM
执行关系
- 一个SM可以同时执行多个线程块
- 当一个线程块被分配到某个SM上后,必须在该SM上完成执行,不能迁移到其他SM
资源分配
SM的硬件资源
- 执行核心(单精度浮点单元、双精度浮点单元等)
- 多级缓存(L1缓存、共享内存、常量缓存、纹理缓存)
- Warp调度器
- 大量寄存器
调度机制
- GPU调度器负责将线程块分配给可用的SM
- 一旦一个SM完成了某个线程块的执行,它会继续处理下一个线程块
- 为了充分利用GPU,通常需要启动的线程块数量应该是SM数量的4倍以上
性能考虑
并行执行
- 同一个线程块中的所有线程在被分配到SM后会同时执行
- SM通过Warp(32个线程为一组)的方式来管理和执行线程
- 当一个线程块中的所有线程都执行完毕后,SM才会释放该块的资源