跳转至

Lecture 12 Parallel Machine Learning, Part 1

Supervised Learning

alt text

Intro

Machine Learning Sources of Confusion

Method vs. Task: A common confusion is between specific learning methods and learning tasks.

  1. Principal Component Analysis (主成分分析) is a method for dimensionality reduction (降维).
  2. Support Vector Machines (支持向量机) are methods used for supervised learning tasks (监督学习).

Another confusion comes from optimization techniques vs. learning methods.

  1. Sequential Minimal Optimization (最小序列优化) is an optimization technique to train Support Vector Machines.
  2. Stochastic Gradient Descent (随机梯度下降) is a popular optimization technique to train Neural Networks.

Machine Learning relies heavily on Linear Algebra

alt text

Parallelism in Machine Learning

特征 隐式并行化 (implicit parallelism) 显式并行化 (explicit parallelism)
控制方式 系统自动控制 程序员手动控制
编程难度 较简单 较复杂
性能优化 有限 可深度优化
调试难度 较低 较高

Supervised Learning

Training Neural Networks

alt text

Deep Neural Network Training

Highly - Abstracted

alt text

Gradient Descent

alt text

Stochastic Gradient Descent (SGD)

alt text

Training Neural Networks

alt text

Parallelization Opportunities

  1. Data parallelism: Distribute all input (sets of images, text, audio, etc.)
    • Batch parallelism: Distribute each full sample to a different processor
    • Domain parallelism: Subdivide one sample and distribute parts to processors
  2. Model parallelism: Distribute the neural network (NN)
  3. Pipeline parallelism: Inter-batch parallelism (1) + pipelined through NN layers (2)
Batch and Domain

Batch Parallelism

在批量并行中,完整的样本被分配给不同的处理器。每个处理器处理完整的、独立的数据样本1。

示例:

假设有100张图片要处理,有4个GPU:

  • GPU1 处理图片 1-25
  • GPU2 处理图片 26-50
  • GPU3 处理图片 51-75
  • GPU4 处理图片 76-100

Domain Parallelism

在域并行中,单个样本被拆分成多个部分,这些部分分别发送到不同的处理器上处理。

示例:

假设处理一张1024x1024的图片:

  • GPU1 处理图片的左上部分 (512x512)
  • GPU2 处理图片的右上部分 (512x512)
  • GPU3 处理图片的左下部分 (512x512)
  • GPU4 处理图片的右下部分 (512x512)

There are some high-level skills around this part, all ignored here, pls see slide.

Pipeline Parallelism

alt text

  • Pipeline parallelism is a mix of (inter-layer) model parallelism, as it parallelizes across the inter-layer NN model structure, and batch parallelism, as it needs micro-batches of data for filling the pipeline.

  • Pipeline bubbles: start of the forward propagation on a minibatch requires the backprop of the previous minibatch to complete

  • Contribution of pipeline parallelism to the total available parallelism is a multiplicative factor that is bounded by the NN depth. Without a deep network, all micro-batching would achieve is batch parallelism.

alt text

Pipeline Parallelism

概念特征

GPipe

  • 使用周期性的pipeline flush来同步权重更新
  • 所有worker使用同一个同步的权重版本
  • 需要定期停止pipeline来进行权重同步,这会降低效率

PipeDream

  • 每个worker可以维护自己的权重版本
  • 采用异步权重更新策略,不需要pipeline flush
  • 在前向传播和反向传播时使用相同版本的权重来保证梯度计算的正确性

内存使用

GPipe

  • 将mini-batch分成多个micro-batch来减少内存占用
  • 通过重新计算中间激活值来节省内存,但会增加计算开销

PipeDream

  • 限制pipeline中的mini-batch数量来减少内存开销
  • 保留必要的激活值而不是重新计算,在内存和计算之间取得平衡

性能优化重点

GPipe

  • 主要优化大模型训练的内存效率
  • 通过micro-batch技术来提高硬件利用率

PipeDream

  • 更注重计算效率和通信开销的优化
  • 通过自动化的模型分区来平衡计算负载

Citations: from perplexity

Unsupervised and Semi-Supervised Learning

Support Vector Machine (SVM)

the linear case

alt text

Non linear case

Kernel Support Vector Machine (the non-linear case)

alt text

  1. Kernel itself is a very tricky function, it can do magic work.
  2. Many problems are hard to handle, but once using kernel function to dimensionality increase, they can be easily processed by linear SVM.

alt text

There are so many magical kernel function and corresponding SVM, pls refer to slide.