Lecture 2 Memory Hierarchies and Matrix Multiplication¶

Current situation

Most applications run at < 10% of the “peak” performance
Much of the performance is lost on a single processor, moving data, instead of calculating.

Processor Analysis¶

Idealized Uniprocessor Model

Processor names variables:
- int / float / pointers, arrays, structures, etc.
Processor performs operations on those variables:
- Arithmetic, logical operations, etc.
Processor controls the order:
- Branches (if), loops, function calls, etc.

In fact, Idealized Cost:

– Each operation (+,*, etc.) has roughly the same cost – And reading/writing variables is "free"

Compiler Optimization¶

Compilers and assembly code

– Check that the program is legal – Translate into assembly code – Optimizes the generated code

alt text

我们尤其需要介绍一下编译器是如何优化“使用最少的寄存器”的

图中的每个节点代表一个变量（a, b, c, d, e, f）
如果两个变量的生命周期有重叠（同时"活着"），它们之间就会有一条边相连
图中的颜色（R1, R2, R3, R4）代表分配给这些变量的寄存器
相连的节点不能使用相同的寄存器，因为它们在同一时间是"活"的

这种图着色问题在1980年被提出，用于优化寄存器分配。通过这种方式，编译器可以最小化所需的寄存器数量，提高代码执行效率。

supplementary

这里提供一个讲解非常清楚的编译器优化寄存器分配Blog，强烈推荐阅读

我们尤其需要注意下面这段讲解：

如果两个变量的用途不冲突，则可以将它们分配给同一个寄存器。
在线性 IR 的基本块中，变量从其第一次定义到最终使用都是活动的。
具有重叠范围的每个变量不能共享相同的寄存器，因为它们必须独立存在。相反，可以将没有重叠生存范围的两个变量分配给同一个寄存器。

寄存器分配现在成为图形着色问题的一个实例。目标是为图中的每个节点分配不同的颜色（寄存器），以便没有两个相邻节点具有相同的颜色。

Besides register allocation, the compiler performs optimizations:

– Unrolls loops (because control isn’t free) - 1 2 3 4 5 6 7 8 9 10 -> [1 2 3 4 5] + [6 7 8 9 10] – Fuses loops (merge two together) – Interchanges loops (reorder) – Eliminates dead code (the branch never taken) – Reorders instructions to improve register reuse and more – Strength reduction (e.g., shift left rather than multiply by 2)

Realistic Uniprocessor Model¶

Time Calculation¶

alt text

Here we can learn the hierarchy of memory.

Latency: time to get the data, = propagation delay
Bandwidth: how much data can be transferred per unit time, = transmission delay

Time to move n consecutive words: 𝛼 + n𝛽

Memory Hierarchy¶

alt text

Cache¶

Cache is fast (expensive) memory which keeps copy of data; it is hidden from software.

Cache line length: # of bytes loaded together in one entry
Associativity
- direct-mapped: only 1 address (line) in a given range in cache
  - Data at address xxxxx1101 stored at cache location 1101
  - Only 1 such value from memory
- n-way associative: n >= 2 lines can be stored
  - Up to n words with addresses xxxxx1101 can be stored at cache location 1101 (every item has a list)

Approaches to Handling Memory Latency¶

1) Reuse values in fast memory (bandwidth filtering)

– need temporal locality in program

2) Move larger chunks (achieve higher bandwidth)

– need spatial locality in program

3) Issue multiple reads/writes in single instruction (higher bw)

– vector operations require access set of locations (typically neighboring)

4) Issue multiple reads/writes in parallel (hide latency)

– prefetching issues read hint – delayed writes (write buffering) stages writes for later operation – both require that nothing dependent is happening (parallelism)

3) and 4) are about concurrency

Little's Law¶

take how much concurrency to overlap latency?

How much concurrency do you need?

Calculate Concurrency => Little's Law (queuing theory)

\(concurrency = latency \times bandwidth\)

alt text

Little's Law 是一个简单但强大的公式:

\(L = λ \times W\)

其中:

L: 系统中的平均数量 = concurrency
λ: 平均到达率(单位时间内) = bandwidth
W: 平均等待时间 = latency

Little's Law

用一个咖啡店的例子来说明:

假设一家咖啡店:

每小时有40位顾客到达(λ = 40人/小时)
每位顾客从进店到离开平均花费6分钟(W = 0.1小时)

应用 Little's Law: L=40×0.1=4

这意味着咖啡店内平均同时有4位顾客

让我们从现实的角度分析，为什么是每时每刻店里平均4人：

每小时有40位顾客进店，意味着平均每1.5分钟就有一位新顾客进来(60分钟/40人)
每位顾客停留6分钟，在这6分钟内:
- 新顾客会持续进来
- 之前的顾客会陆续离开
- 形成一个动态平衡
在任意时刻:
- 有些顾客刚进来
- 有些顾客正在点单
- 有些顾客等待取咖啡
- 有些顾客准备离开

最终平均值是4的原因:

当一位顾客停留6分钟
而每1.5分钟就有一位新顾客
在6分钟内会有4位顾客重叠在店内 (6分钟 ÷ 1.5分钟/人 = 4人)

Pipelining¶

alt text

Bandwidth: loads per hour
- Bandwidth limited by slowest pipeline stage
Speedup <= # of stages

SIMD¶

Need to:

Expose parallelism to compiler (or write manually)
Be contiguous in memory and cache aligned (in most cases)

alt text

Data Dependencies Limit Parallelism

Obviously, parallelism can get the wrong answer if instructions execute out of order.

Types of dependencies

RAW: Read-After-Write
- X = A ; B = X;
WAR: Write-After-Read
- A = X ; X = B;
WAW: Write-After-Write
- X = A ; X = B;
No problem / dependence for RAR: Read-After-Read

Hardware and Compiler can’t break these dependencies.

FMA: Fused Multiply Add¶

Fused Multiply Add 融合加乘

Multiply followed by add is very common in programs
- x = y + c * z
FMA is a single instruction that does both multiply and add
- Same rate as + or * alone
- with a single rounding step

interesting time-consuming

流水线优化
- FMA在硬件层面被设计为一个单一的原子操作
- 不需要在乘法和加法之间存储中间结果
- 整个操作在一个流水线周期内完成
专用电路设计
- 现代处理器中的FPU(浮点运算单元)是围绕FMA构建的
- 使用组合逻辑电路实现乘法器，直接连接到加法器
- 避免了中间结果的舍入和规范化步骤
硬件资源共享
- 加法和乘法操作使用相同的浮点乘加子单元
- 在AMD的"Piledriver"处理器中，所有这些操作都共享相同的硬件资源

通过这种硬件级的优化设计，FMA实现了在单个时钟周期内完成乘加运算，使其性能与单个加法或乘法操作相当。

Take Away¶

alt text