A Modern Multi-Core Processor¶

Part 1¶

这节课系统梳理了一下三种最基础的 “并行设计”, 可能是要为下一讲的超线程之类的做铺垫, 总之特别特别重要

按顺序提出了这三种设计:

Superscalar Processor: 超标量处理器
SIMD: 单指令流-多数据流
Multi-core: 多核设计

Tip

注意这三者并不是相互对立的, 而是完全独立于彼此的, 因此在真正的系统设计中, 三者是统一在一起的

此外, 这节课还系统复习了 cache / 条件执行(分支预测) 等知识, 但笔者比较熟悉, 故不再浪费笔墨...

Baseline: A very simple processor¶

executes one instruction per clock

alt text

特征概括:

`instruction flow` num	`fetch/decode` num	`exec unit` num	core
1	mul	mul	on one core

Superscalar Processor¶

alt text

(1) 硬件更新:

多个 Fetch/Decode Data
多个 Exec Unit

效果: 如果ILP满足要求的话 (某些指令之间不存在依赖关系), 可以自动并行输入指令, 并执行它们

(2) 注意: 依旧是一条输入的指令流

(3) 特征概括:

`instruction flow` num	`fetch/decode` num	`exec unit` num	core	program cmp. baseline
1	mul	mul	on one core	no diff

Warning

注意这里的并行是“潜在的/有可能的”, 并不一定会完全做到, 因为触发这个并行的前提条件是 "ILP满足要求"

SIMD¶

alt text

(1) 硬件更新:

多个 Exec Unit

效果: 在上层高级语言代码明确指定的情况下, 向量化执行, 起到并行的效果

(2) 注意: 依旧是一条输入的指令流, 同一条指令同时广播到所有的ALU上

(3) 特征概括:

`instruction flow` num	`fetch/decode` num	`exec unit` num	core	program cmp. baseline
1	1	mul	on one core	vectorized

(4) 代码的具体改动 (体现向量化):

alt text

Multi-core¶

Motivation:

Rather than use transistors to increase sophistication of processor logic that accelerates a single instruction stream (e.g., out-of-order and speculative operations in Superscalar Processor).

Use increasing transistor count to add more cores to the processor!!!

alt text

(1) 硬件更新:

格局打开, 前面两个都是在说“一个core上的机制”, 我们直接放眼“multi-core”能够做些什么

(2) 注意: 对于处理器上的任一个core, 依旧是一条输入的指令流

(3) 特征概括:

`instruction flow` num	`fetch/decode` num	`exec unit` num	core	program cmp. baseline
1	each core has one ins flow	TBD	multi core	using c++ threads

(4) 代码的具体改动 (体现分线程):

alt text

TL; DR¶

三种模式的异同对比¶

alt text

Misc¶

Hiding stalls with multi-threading

核心idea: 反正当前这个还要很长时间, 不妨先调转枪头去做别的

alt text

带来一些开销:

alt text

“coherent execution && divergent execution

coherent execution: 一致性执行
1. def: 有一些相同的指令反复执行 (Property of a program where the same instruction sequence applies to many data elements)
2. 对SIMD的efficiency很重要, 因为 vectorized input 需要同质化执行
3. 对multi-core无所谓, 因为不同core上的指令输入流完全不同, 本身就是异质的
divergent execution: 异质化执行
1. def: A lack of instruction stream coherence