Lecture 3 Matrix Multiplication and the Roofline Model¶
Matrix Multiplication Analysis¶
Alternate Data Layouts¶

Actually we love this "Z" layout, for every element is adjacent to its neighbor.
The only cons is it's hard to index elements
Theory for All Layouts¶
Theory: Communication lower bound
Theorem (Hong & Kung, 1981):
Any reorganization of matmul (using only commutativity and associativity) has computational intensity q = \(\sqrt{M_{\text{fast}}}\), so #words moved between fast/slow memory = \(\Omega \left( \frac{n^3}{\sqrt{M_{\text{fast}}}} \right)\)

There are still some progress with the development, but we ignore here.
Roofline Model¶
How fast can an algorithm go in practice?
Definition¶
Three pieces: 2 for machine and 1 for application
- Arithmetic performance (flops/sec)- Clock Speed and Parallelism (ILP, SIMD, Multicore) are focus on this part
 
- Memory bandwidth (bytes /sec)- Reduce Latency is focus on this part
 
- Computational (Arithmetic) Intensity- Application Balance (between computation and communication)
 
Some definitions:
- Machine Balance: \(\frac{\text{(Peak DP Flow/s)}}{\text{(Memory Bandwidth)}}\)
- CI/AI: Computational / Arithmetic Intensity, \(\frac{\text{(DP Performed)}}{\text{(Data Moved)}}\)

Model¶

Roofline模型是一个直观的性能可视化模型,用于评估计算内核或应用程序在特定硬件架构上的性能上限。
- 性能上限- 计算上限: 处理器的峰值计算性能(FLOPS)
- 内存带宽上限: 系统的最大内存带宽
 
- 算术强度(Arithmetic Intensity)- 定义为每字节内存访问的浮点运算数量
- 计算公式: AI = FLOPS/内存访问字节数
- 是衡量程序计算密度的重要指标
 
Roofline模型通过一个双对数坐标图来展示:
- 坐标轴- X轴: 算术强度(FLOPS/Byte)
- Y轴: 性能(FLOPS/s)
 
- 性能边界- 斜线部分: 受内存带宽限制的区域
- 水平线部分: 受计算能力限制的区域
 
- 拐点(Ridge Point)- 斜线和水平线的交点
- 表示系统在计算和内存访问之间达到平衡的点
 
应用场景:
- 性能瓶颈分析- 确定程序是计算密集型还是内存密集型
- 识别性能优化的方向
 
- 优化指导- 对于内存受限的程序,需要优化内存访问模式
- 对于计算受限的程序,需要优化计算效率