跳转至

Lecture 1 Introduction & Overview

  • Using multiple processors in parallel to solve problems more quickly than with a single processor.
  • Memory requirements become large and we need distributed memory.

Parallel Computer Model

alt text

  1. SMP: Shared Memory mutilProcessor
    • aka. Multicore
    • connecting multiple processors to a single memory system
    • directly access the shared memory for everyone
    • illustration: a cluster of computers are linked and communicating together directly
    • multiple processors (cores) on a single chip
  2. HPC: High Performance processors Computing
    • aka. Distributed Memory (DM) / cluster
    • processors have their own memory, they are connected by network
    • distributed systems, contains many processors (nodes)
  3. SIMD Computer: Single Instruction Multiple Data
    • to be specific: A huge vector computer
    • multiple processors (or functional units) that perform the same operation on multiple data elements at once
    • most single processors have SIMD units with ~2-8 way parallelism
    • Graphics processing units (GPUs) use this
quote
  1. Performance = parallelism
  2. Efficiency = locality
  • Bill Dally (NVIDIA and Stanford)

Comparison

What's not a Parallel Computer?

Concurrency vs. Parallelism

alt text

  1. Parallelism: multiple tasks are actually active, physically running at the same time.
  2. Concurrency: multiple tasks are logically active, use time-sharing technique to make a illusion of "running at the same time".
Pa vs. Co

基本概念

  1. 并发(Concurrency): 同一时间段内应对多个任务的能力,通过任务间的切换来实现"表面上的同时进行"
  2. 并行(Parallelism): 同一时刻真正同时执行多个任务的能力,需要多个执行单元(如CPU核心)的硬件支持

关键区别

  1. 执行方式

    • 并发是通过时间片轮转交替执行多个任务
    • 并行是多个任务在同一时刻真正同时执行
  2. 资源需求

    • 并发可以在单核处理器上实现
    • 并行需要多核处理器或分布式系统支持
  3. 设计层面

    • 并发是问题域的概念,关注如何结构化程序以处理多个同时发生的事件
    • 并行是方法域的概念,关注如何利用多个计算资源提高执行效率

实例

  1. 并发适用于(常见在一个系统内部,维护操作层面):

    • I/O密集型任务
    • 需要保持程序响应性的情况
    • 任务之间有依赖或需要协调的场景
  2. 并行适用于(常见在多个系统协作的层面):

    • CPU密集型计算
    • 数据并行处理
    • 独立任务的同时处理

Parallel Computer vs. Distributed System

  1. A distributed system is inherently distributed, i.e., serving clients at different locations
  2. A parallel computer may use distributed memory (multiple processors with their own memory) for more performance
    • review parallel computer model: SMP / HPC / SIMD
    • If we use a HPC model, it's more similar to distributed system.

Units of Measure for HPC

High Performance Computing (HPC) units are:

– Flop: floating point operation, usually double precision unless noted – Flop/s: floating point operations per second – Bytes: size of data (a double precision floating point number is 8 bytes)

We usually use Flop/s as a standard to measure the performance of HPC.

Trends in HPC:

  • Big / Fast Systems are all GPU-accelerated
  • Average System ages are increasing
    • cost increasing
    • performance increasing is not so fast, so one-generation productions can be used for a long time relatively

Current Paradigm of Science:

  1. Theory
  2. Experiments
  3. Simulation
  4. Data Analysis
  5. Machine Learning

Since 2005, all computers are Parallel Computers!

End of Moore's Law

Moore's Law: transistor density of semiconductor chips would double roughly every 18 months.

  1. Before 2000, transistor density and frequency were increasing at the same time.
  2. Frequency reaches its bottleneck
  3. But the actual Moore’s law seems alive (so far)
  4. Still alive for some years, but it's inevitable that it will end in the near future!
Reason

1) 晶体管尺寸缩小的影响

当晶体管特征尺寸按因子x缩小时,会产生以下连锁反应:

  1. 时钟频率提升

    • 由于导线变短,时钟频率理论上可以提高x倍
      • (尺寸减小x,路程短x,随之用时减少,频率提升x)
    • 但实际提升小于x倍,因为功耗会限制最大频率
  2. 晶体管密度增加

    • 单位面积的晶体管数量增加x²倍
      • 这是因为晶体管在二维平面上都缩小了x倍(一维x,二维x²)
  3. 芯片尺寸(Die Size)变化

    • 芯片尺寸实际上增大了
    • 这种增大与特征尺寸缩小同时发生,而不是因果关系
    • 通常芯片尺寸也会增加约x倍

Die: 芯片

2)性能提升计算

==硬件==计算能力提升

  • 根据上述,芯片的原始计算能力提升了约x⁴倍!
  • 这个x⁴来自:
    • x² (晶体管密度增加)
    • x (时钟频率提升)
    • x (芯片尺寸增加)

==其他优化==带来的性能提升

  • x³的性能提升通常用于:
    • 指令级并行(ILP)
    • 缓存等局部性优化
  • 这使得一些程序无需修改就能获得x³倍的性能提升

3)硬件带来的提升将走到上限

串行计算机的速度极限:

  • 1 Tflop/s的顺序机器
  • 信号在0.3mm距离内传播
  • 这个限制是由光速决定的

4)要在此基础上,进一步提升性能,必须采用并行计算架构

这解释了为什么现代处理器设计越来越倾向于多核和并行架构,而不是单纯追求更高的时钟频率