跳转至

Lecture 7 Introduction to Graphics Processing Units (GPUs)

Overview for CPU and GPU

  • CPU goal: minimum latency
  • GPU goal: maximum throughput

alt text

alt text

alt text

alt text

Deal with Stalls

As is known to all, stalls are the enemy of performance.

  1. Stalls: a core cannot run the next instruction because of a dependency on a previous one
  2. Memory operations are 100s-1000s of cycles
  3. Removed the fancy caches and prefetch logic that helps avoid stalls

So GPU takes a huge amount of threads to make a full use.

alt text

GPUs in Practice

We need to pay attention to SM (Stream Multiprocessor).

alt text

Programming GPUs

Heterogeneous Programming (CPU + GPU)

alt text

Scenario: Only CPU works

alt text

Running GPU Code (Kernel)

You can consider CPU as a boss, and GPU is a high-throughput worker.

So to write CPU + GPU code, we need to stand on the perspective of CPU, the process is:

  1. Allocate memory on GPU
  2. Copy data to GPU
  3. Execute GPU program
  4. Wait to complete
  5. Copy results back to CPU

alt text

C
1
add<<<1,1>>>(N, d_x, d_y);

We may be confused about this <<<1,1>>>.

  1. The first 1 means the number of blocks
  2. The second 1 means the number of threads per block

We will introduce it then.

How to Write GPU Code

C
1
2
// Run kernel on GPU
add<<<1,256>>>(N, d_x, d_y);

Threads of this block (256 here) must be <= 1024 on Volta (previously 512)

C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// GPU function to add two vectors 
__global__ 
void add(int n, float *x, float *y) 
{ 
    int index = threadIdx.x;
    int stride = blockDim.x; // CUDA’s # threads
    for (int i = index; i < n; i+=stride)
        y[i] = x[i] + y[i]; 
    // Works for arbitrary N and # threads / block, only one block
}

GPU Programming Model

Grids and Thread Blocks

Grid > Thread Block > Thread

1D 1D

alt text

2D 1D

alt text

2D 2D

alt text

Take-Away

  1. Grid is a collection of threads.
  2. Threads in a grid execute a kernel function and are divided into thread blocks.
  3. gridDim: the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

alt text

Sharing and Synchronization

Shared Memory on Nvidia GPUs

alt text

alt text

alt text

Thread Synchronization

Synchronizes all threads within a block

C
1
void __syncthreads();
  • Used to prevent RAW / WAR / WAW hazards
  • All threads in the block must reach the barrier
  • If used inside a conditional, the condition must be uniform across the block
Review
  • RAW: Read After Write
  • WAR: Write After Read
  • WAW: Write After Write
C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__global__ void stencil_1d(int *in, int *out) {
    // code from earlier slide to setup temp halos…

    // Synchronize (ensure all the data is available) 
    __syncthreads(); // Apply the stencil 
    int result = 0; 
    for (int offset = -RADIUS ; offset <= RADIUS ; offset++) 
        result += temp[lindex + offset]; // Store the result
    out[gindex] = result;
}

Threads within a block may synchronize with barriers

C
1
2
3
 Step 1  
__syncthreads(); 
 Step 2 

Implicit barrier between kernels

C
1
2
3
vec_minus<<<nblocks, blksize>>>(a, b, c); 
// ------------implicit barrier---------------
vec_dot<<<nblocks, blksize>>>(c, c);

CUDA

Extensions to C/C++

  • Declaration specifiers to indicate where things live
    C
    1
    2
    3
    4
    __global__ void KernelFunc(...); // kernel callable from host
    __device__ void DeviceFunc(...); // function callable on device 
    __device__ int GlobalVar;  // variable in device memory
    __shared__ int SharedVar; // in per-block shared memory
    
  • Extend function invocation syntax for parallel kernel launch
    C
    1
    KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
    
  • Special variables for thread identification in kernels
    C
    1
    2
    3
    dim3 threadIdx;
    dim3 blockIdx;
    dim3 blockDim;
    
  • Intrinsics that expose specific operations in kernel code
    C
    1
    __syncthreads(); // barrier synchronization
    

Features available on GPU

  • Double and single precision
  • Standard mathematical functions – sinf, powf, atanf, ceil, min, sqrtf, etc.
  • Atomic memory operations – atomicAdd, atomicMin, atomicAnd, atomicCAS, etc.
  • These work on both global and shared memory

Runtime support

  • Explicit memory allocation returns pointers to GPU memory – cudaMalloc(), cudaFree()cudaMallocManaged();
  • Explicit memory copy for host -- device, device -- device – cudaMemcpy(), cudaMemcpy2D(), ...

Summary

We organize all the hierarchy about GPU here.

Threads:

Each thread is a SIMD lane (ALU)

Warps:

– A warp executed as a logical SIMD instruction (sort of) – Warp width is 32 elements: LOGICAL SIMD width – (Warp-level programming also possible)

Thread blocks:

Each thread block is scheduled onto an SM – Peak efficiency requires multiple thread blocks per processor

Kernel:

– Executes on a GPU (there is also multi-GPU programming)

SM and TB

基本架构关系

SM的定义

  • SM是NVIDIA GPU的基本处理单元,是一个通用处理器,具有较低的时钟频率和较小的缓存
  • 每个GPU架构包含多个SM,例如NVIDIA A100 GPU就包含108个SM

执行关系

  • 一个SM可以同时执行多个线程块
  • 当一个线程块被分配到某个SM上后,必须在该SM上完成执行,不能迁移到其他SM

资源分配

SM的硬件资源

  • 执行核心(单精度浮点单元、双精度浮点单元等)
  • 多级缓存(L1缓存、共享内存、常量缓存、纹理缓存)
  • Warp调度器
  • 大量寄存器

调度机制

  • GPU调度器负责将线程块分配给可用的SM
  • 一旦一个SM完成了某个线程块的执行,它会继续处理下一个线程块
  • 为了充分利用GPU,通常需要启动的线程块数量应该是SM数量的4倍以上

性能考虑

并行执行

  • 同一个线程块中的所有线程在被分配到SM后会同时执行
  • SM通过Warp(32个线程为一组)的方式来管理和执行线程
  • 当一个线程块中的所有线程都执行完毕后,SM才会释放该块的资源