Lecture 1 Introduction & Overview¶
- Using multiple processors in parallel to solve problems more quickly than with a single processor.
- Memory requirements become large and we need distributed memory.
Parallel Computer Model¶
- SMP: Shared Memory mutilProcessor
- aka. Multicore
- connecting multiple processors to a single memory system
- directly access the shared memory for everyone
- illustration: a cluster of computers are linked and communicating together directly
- multiple processors (cores) on a single chip
- HPC: High Performance processors Computing
- aka. Distributed Memory (DM) / cluster
- processors have their own memory, they are connected by network
- distributed systems, contains many processors (nodes)
- SIMD Computer: Single Instruction Multiple Data
- to be specific: A huge vector computer
- multiple processors (or functional units) that perform the same operation on multiple data elements at once
- most single processors have SIMD units with ~2-8 way parallelism
- Graphics processing units (GPUs) use this
quote
- Performance = parallelism
- Efficiency = locality
- Bill Dally (NVIDIA and Stanford)
Comparison¶
What's not a Parallel Computer?
Concurrency vs. Parallelism
- Parallelism: multiple tasks are actually active, physically running at the same time.
- Concurrency: multiple tasks are logically active, use time-sharing technique to make a illusion of "running at the same time".
Pa vs. Co
基本概念
- 并发(Concurrency): 同一时间段内应对多个任务的能力,通过任务间的切换来实现"表面上的同时进行"
- 并行(Parallelism): 同一时刻真正同时执行多个任务的能力,需要多个执行单元(如CPU核心)的硬件支持
关键区别
-
执行方式
- 并发是通过时间片轮转交替执行多个任务
- 并行是多个任务在同一时刻真正同时执行
-
资源需求
- 并发可以在单核处理器上实现
- 并行需要多核处理器或分布式系统支持
-
设计层面
- 并发是问题域的概念,关注如何结构化程序以处理多个同时发生的事件
- 并行是方法域的概念,关注如何利用多个计算资源提高执行效率
实例
-
并发适用于(常见在一个系统内部,维护操作层面):
- I/O密集型任务
- 需要保持程序响应性的情况
- 任务之间有依赖或需要协调的场景
-
并行适用于(常见在多个系统协作的层面):
- CPU密集型计算
- 数据并行处理
- 独立任务的同时处理
Parallel Computer vs. Distributed System
- A distributed system is inherently distributed, i.e., serving clients at different locations
- A parallel computer may use distributed memory (multiple processors with their own memory) for more performance
- review parallel computer model: SMP / HPC / SIMD
- If we use a HPC model, it's more similar to distributed system.
Units of Measure for HPC
High Performance Computing (HPC) units are:
– Flop: floating point operation, usually double precision unless noted – Flop/s: floating point operations per second – Bytes: size of data (a double precision floating point number is 8 bytes)
We usually use Flop/s
as a standard to measure the performance of HPC.
New Trends¶
Trends in HPC:
- Big / Fast Systems are all GPU-accelerated
- Average System ages are increasing
- cost increasing
- performance increasing is not so fast, so one-generation productions can be used for a long time relatively
Current Paradigm of Science:
- Theory
- Experiments
- Simulation
- Data Analysis
- Machine Learning
Since 2005, all computers are Parallel Computers!
End of Moore's Law¶
Moore's Law: transistor density of semiconductor chips would double roughly every 18 months.
- Before 2000, transistor density and frequency were increasing at the same time.
- Frequency reaches its bottleneck
- But the actual Moore’s law seems alive (so far)
- Still alive for some years, but it's inevitable that it will end in the near future!
Reason
1) 晶体管尺寸缩小的影响
当晶体管特征尺寸按因子x缩小时,会产生以下连锁反应:
-
时钟频率提升
- 由于导线变短,时钟频率理论上可以提高x倍
- (尺寸减小x,路程短x,随之用时减少,频率提升x)
- 但实际提升小于x倍,因为功耗会限制最大频率
- 由于导线变短,时钟频率理论上可以提高x倍
-
晶体管密度增加
- 单位面积的晶体管数量增加x²倍
- 这是因为晶体管在二维平面上都缩小了x倍(一维x,二维x²)
- 单位面积的晶体管数量增加x²倍
-
芯片尺寸(Die Size)变化
- 芯片尺寸实际上增大了
- 这种增大与特征尺寸缩小同时发生,而不是因果关系
- 通常芯片尺寸也会增加约x倍
Die: 芯片
2)性能提升计算
==硬件==计算能力提升
- 根据上述,芯片的原始计算能力提升了约x⁴倍!
- 这个x⁴来自:
- x² (晶体管密度增加)
- x (时钟频率提升)
- x (芯片尺寸增加)
==其他优化==带来的性能提升
- x³的性能提升通常用于:
- 指令级并行(ILP)
- 缓存等局部性优化
- 这使得一些程序无需修改就能获得x³倍的性能提升
3)硬件带来的提升将走到上限
串行计算机的速度极限:
1 Tflop/s
的顺序机器- 信号在0.3mm距离内传播
- 这个限制是由光速决定的
4)要在此基础上,进一步提升性能,必须采用并行计算架构
这解释了为什么现代处理器设计越来越倾向于多核和并行架构,而不是单纯追求更高的时钟频率