Lecture 7 Introduction to Graphics Processing Units (GPUs)¶
Overview for CPU and GPU¶
- CPU goal: minimum latency
- GPU goal: maximum throughput




Deal with Stalls¶
As is known to all, stalls are the enemy of performance.
- Stalls: a core cannot run the next instruction because of a dependency on a previous one
- Memory operations are 100s-1000s of cycles
- Removed the fancy caches and prefetch logic that helps avoid stalls
So GPU takes a huge amount of threads to make a full use.

GPUs in Practice¶
We need to pay attention to SM (Stream Multiprocessor).

Programming GPUs¶
Heterogeneous Programming (CPU + GPU)¶

Scenario: Only CPU works

Running GPU Code (Kernel)
You can consider CPU as a boss, and GPU is a high-throughput worker.
So to write CPU + GPU code, we need to stand on the perspective of CPU, the process is:
- Allocate memory on GPU
- Copy data to GPU
- Execute GPU program
- Wait to complete
- Copy results back to CPU

| C | |
|---|---|
1 | |
We may be confused about this <<<1,1>>>.
- The first 1 means the number of blocks
- The second 1 means the number of threads per block
We will introduce it then.
How to Write GPU Code¶
| C | |
|---|---|
1 2 | |
Threads of this block (256 here) must be <= 1024 on Volta (previously 512)
| C | |
|---|---|
1 2 3 4 5 6 7 8 9 10 | |
GPU Programming Model¶
Grids and Thread Blocks
Grid > Thread Block > Thread
1D 1D¶

2D 1D¶

2D 2D¶

Take-Away¶
- Grid is a collection of threads.
- Threads in a grid execute a kernel function and are divided into thread blocks.
gridDim: the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

Sharing and Synchronization¶
Shared Memory on Nvidia GPUs¶



Thread Synchronization¶
Synchronizes all threads within a block
| C | |
|---|---|
1 | |
- Used to prevent
RAW/WAR/WAWhazards - All threads in the block must reach the barrier
- If used inside a conditional, the condition must be uniform across the block
Review
- RAW: Read After Write
- WAR: Write After Read
- WAW: Write After Write
| C | |
|---|---|
1 2 3 4 5 6 7 8 9 10 | |
Threads within a block may synchronize with barriers
| C | |
|---|---|
1 2 3 | |
Implicit barrier between kernels
| C | |
|---|---|
1 2 3 | |
CUDA¶
Extensions to C/C++
- Declaration specifiers to indicate where things live
C 1 2 3 4
__global__ void KernelFunc(...); // kernel callable from host __device__ void DeviceFunc(...); // function callable on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // in per-block shared memory - Extend function invocation syntax for parallel kernel launch
C 1KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each - Special variables for thread identification in kernels
C 1 2 3
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; - Intrinsics that expose specific operations in kernel code
C 1__syncthreads(); // barrier synchronization
Features available on GPU
- Double and single precision
- Standard mathematical functions
–
sinf,powf,atanf,ceil,min,sqrtf, etc. - Atomic memory operations – atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.
- These work on both global and shared memory
Runtime support
- Explicit memory allocation returns pointers to GPU memory
–
cudaMalloc(),cudaFree()–cudaMallocManaged(); - Explicit memory copy for host -- device, device -- device
–
cudaMemcpy(),cudaMemcpy2D(), ...
Summary¶
We organize all the hierarchy about GPU here.
Threads:
– Each thread is a SIMD lane (ALU)
Warps:
– A warp executed as a logical SIMD instruction (sort of) – Warp width is 32 elements: LOGICAL SIMD width – (Warp-level programming also possible)
Thread blocks:
– Each thread block is scheduled onto an SM – Peak efficiency requires multiple thread blocks per processor
Kernel:
– Executes on a GPU (there is also multi-GPU programming)
SM and TB
基本架构关系
SM的定义
- SM是NVIDIA GPU的基本处理单元,是一个通用处理器,具有较低的时钟频率和较小的缓存
- 每个GPU架构包含多个SM,例如NVIDIA A100 GPU就包含108个SM
执行关系
- 一个SM可以同时执行多个线程块
- 当一个线程块被分配到某个SM上后,必须在该SM上完成执行,不能迁移到其他SM
资源分配
SM的硬件资源
- 执行核心(单精度浮点单元、双精度浮点单元等)
- 多级缓存(L1缓存、共享内存、常量缓存、纹理缓存)
- Warp调度器
- 大量寄存器
调度机制
- GPU调度器负责将线程块分配给可用的SM
- 一旦一个SM完成了某个线程块的执行,它会继续处理下一个线程块
- 为了充分利用GPU,通常需要启动的线程块数量应该是SM数量的4倍以上
性能考虑
并行执行
- 同一个线程块中的所有线程在被分配到SM后会同时执行
- SM通过Warp(32个线程为一组)的方式来管理和执行线程
- 当一个线程块中的所有线程都执行完毕后,SM才会释放该块的资源