Chapter 13 Query Execution Part 2¶

We discussed in the last class how to compose operators together into a plan to execute an arbitrary query.

We assumed that the queries execute with a single worker (e.g., a thread).

We will now discuss how to execute queries using multiple workers.

WHY CARE ABOUT PARALLEL EXECUTION?

Increased performance for potentially the same hardware resources.
- Higher Throughput
- Lower Latency
Increased responsiveness of the system.
Potentially lower total cost of ownership (TCO)
- Fewer machines means less parts / physical footprint / energy consumption.

PARALLEL VS. DISTRIBUTED

Database is spread out across multiple resources to improve different aspects of the DBMS.

Appears as a single logical database instance to the application, regardless of physical organization.

SQL query for a single-resource DBMS should generate same result on a parallel or distributed DBMS.

Parallel DBMSs

Resources are physically close to each other.
Resources communicate over high-speed interconnect.
Communication is assumed to be cheap and reliable.

Distributed DBMSs

Resources can be far from each other. (large physical distance)
Resources communicate using slow(er) interconnect.
Communication cost and problems cannot be ignored.

Process Models¶

A DBMS’s process model defines how the system is architected to support concurrent requests from a multi-user application.

A worker is the DBMS component that is responsible for executing tasks on behalf of the client and returning the results.

Approach #1: Process per DBMS Worker
Approach #2: Thread per DBMS Worker
Approach #3: Embedded DBMS

PROCESS PER WORKER¶

Each worker is a separate OS process.

Process: 进程

Relies on OS scheduler.
Use shared-memory for global data structures.
A process crash does not take down entire system. Examples: IBM DB2, Postgres, Oracle.

alt text

THREAD PER WORKER¶

Thread: 线程

Single process with multiple worker threads.

DBMS (mostly) manages its own scheduling.
May or may not use a dispatcher thread.
Thread crash (may) kill the entire system.
Examples: MSSQL, MySQL, DB2, Oracle (2014).

alt text

What's the difference between PW and TW?

在“每个工作者一个进程”（Process Per Worker）模型中，每个工作者（Worker）都是一个独立的操作系统进程。每个进程都有自己独立的内存空间，操作系统负责进程的调度和管理。

依赖操作系统调度：
- 每个工作者都是一个独立的进程，操作系统的调度程序决定每个进程何时运行。
使用共享内存：
- 共享内存用于全局数据结构的访问。进程间通信（IPC）主要通过共享内存来实现。
进程崩溃的影响：
- 由于每个工作者是独立的进程，一个进程的崩溃不会导致整个系统崩溃。这提供了一定的隔离和容错能力。

示例

假设一个数据库系统中有多个客户端请求并发访问和操作数据，每个请求由一个独立的进程来处理。例如，Postgres 数据库系统。

Postgres：每个客户端连接都会创建一个新的后台进程来处理请求。假设有10个客户端连接到Postgres数据库，那么Postgres会启动10个独立的进程来处理这些连接。每个进程都有自己独立的内存空间，彼此之间通过共享内存进行通信和协调。如果其中一个进程崩溃，其他进程可以继续正常工作，不会影响到整体系统的稳定性。

在“每个工作者一个线程”（Thread Per Worker）模型中，所有工作者在同一个操作系统进程内作为线程运行。线程由数据库管理系统（DBMS）自己调度和管理，而不是依赖操作系统。

DBMS 自行调度：
- 数据库系统大多数情况下自行调度线程的执行，而不是依赖操作系统的调度程序。这使得数据库系统可以更好地优化调度策略，以提高性能。
是否使用调度线程：
- 系统可能会使用一个或多个调度线程来分配任务给工作线程，或者直接让工作线程自己处理任务。
线程崩溃的影响：
- 由于所有线程在同一个进程中运行，如果一个线程崩溃，可能导致整个进程崩溃，从而影响整个系统的稳定性。

示例

假设一个数据库系统中有多个客户端请求并发访问和操作数据，每个请求由一个线程来处理。例如，MySQL 数据库系统。

MySQL：MySQL的多线程模型在单个进程内创建多个线程，每个客户端连接由一个独立的线程处理。如果有10个客户端连接到MySQL数据库，MySQL会在一个进程内启动10个线程来处理这些连接。线程共享进程的内存空间，这使得线程间通信比进程间通信更加高效。然而，如果其中一个线程出现严重错误，可能导致整个进程崩溃，从而影响所有连接。

总结

特性	Process Per Worker	Thread Per Worker
调度	由操作系统调度	由数据库系统自行调度
内存隔离	每个进程有独立的内存空间	所有线程共享同一个进程的内存空间
进程/线程崩溃的影响	一个进程崩溃不会影响其他进程	一个线程崩溃可能导致整个进程崩溃
进程间/线程间通信	通过共享内存或其他进程间通信机制实现	通过进程内共享的内存实现
示例	Postgres、IBM DB2、Oracle	MySQL、MSSQL、Oracle（2014年后）、DB2

这两种模型各有优缺点。选择哪种模型取决于系统的需求和设计目标。Process Per Worker 提供更好的隔离和容错性，而 Thread Per Worker 则提供更高的性能和效率。

SCHEDULING¶

For each query plan, the DBMS decides where, when, and how to execute it.

How many tasks should it use?
How many CPU cores should it use?
What CPU core should the tasks execute on?
Where should a task store its output?

The DBMS always knows more than the OS. (in Database System)

SQL SERVER – SQLOS¶

SQLOS is a user-level OS layer that runs inside of the DBMS and manages provisioned (已配置的) hardware resources.

Determines which tasks are scheduled onto which threads.
Also manages I/O scheduling and higher-level concepts like logical database locks.

Non-preemptive thread scheduling through instrumented DBMS code. 通过检测 DBMS 代码进行非抢占式线程调度

How to design SQLOS

MSSQL SQLOS

alt text

HOW TO CONTROL SQLOS¶

这里展示了SQL OS（SQL Operating System）中的调度机制以及 DBMS（数据库管理系统）开发者需要如何在代码中插入显式的 yield 调用来进行控制

SQL OS 时间量 (Quantum)

Quantum 是 4 毫秒：
- SQL OS 设定了一个时间片（quantum），即每个任务最多可以连续运行 4 毫秒。
调度器无法强制执行时间片：
- 尽管设定了 4 毫秒的时间片，但调度器不能自动中断正在运行的任务来严格强制执行这个时间片。

DBMS 开发者的责任

显式 yield 调用：
- 由于调度器无法强制中断任务，DBMS 开发者必须在代码的适当位置添加显式的 yield 调用，以确保任务在运行一段时间后让出 CPU。

Python
last = now()
for tuple in R:
    if now() - last > 4:
        yield
        last = now()
    if eval(predicate, tuple, params):
        emit(tuple)

EMBEDDED DBMS¶

DBMS runs inside of the same address space as the application. Application is (mostly) responsible for threads and scheduling.

The application may support outside connections.

Examples: BerkeleyDB, SQLite, RocksDB, LevelDB

alt text

Execution Parallelism¶

INTER- VS. INTRA- QUERY PARALLELISM¶

Note

Advantages of a multi-threaded architecture: Thread!

Less overhead per context switch.
Do not have to manage shared memory.

The thread per worker model does not mean that the DBMS supports intra-query parallelism.

根据上面的描述，DBMS 并不一定支持在同一个查询内并行处理。每个线程可能负责处理一个独立的查询，而不是多个线程并行处理一个查询的不同部分！

Andy is not aware of any new DBMS from last 15 years that doesn't use native OS threads unless they are Redis or Postgres forks.

Inter-Query: Execute multiple disparate (不同的) queries simultaneously.

Increases throughput & reduces latency.

Intra-Query: Execute the operations of a single query in parallel.

Decreases latency for long-running queries, especially for OLAP queries.

INTER-QUERY PARALLELISM¶

Improve overall performance by allowing multiple queries to execute simultaneously.

If queries are read-only, then this requires almost no explicit coordination between queries.
- Buffer pool can handle most of the sharing if necessary
If multiple queries are updating the database at the same time, then this is hard to do correctly…

只读查询

几乎不需要显式协调：
- 当查询是只读的（即不修改数据库中的数据），多个查询可以同时进行而不需要复杂的同步和协调。因为这些查询只读取数据，而不修改数据，所以不会发生冲突或不一致的情况。
缓冲池处理共享 (进一步提升)：
- Buffer Pool 是数据库系统中的一个组件，用于管理数据的缓存。当多个查询需要访问相同的数据时，缓冲池可以缓存这些数据，减少磁盘 I/O 操作，提高查询性能。如果需要，缓冲池可以有效地处理这些只读查询之间的数据共享。

更新查询

多查询同时更新的复杂性：
- 当多个查询同时更新数据库时，需要进行复杂的协调和同步。这是因为多个查询可能会同时修改相同的数据，导致数据的不一致性和冲突。
- 例如，一个查询可能正在更新某一行的数据，而另一个查询同时也试图更新或读取同一行的数据。在这种情况下，必须使用锁、事务和其他同步机制来确保数据库的一致性和正确性。

INTRA-QUERY PARALLELISM¶

Improve the performance of a single query by executing its operators in parallel.

Think of organization of operators in terms of a producer/consumer paradigm (范式).

There are parallel versions of every operator.

Can either have multiple threads access centralized data structures or use partitioning to divide work up.

Parallel Grace Hash Join

alt text

Use a separate worker to perform the join for each level of buckets for R and S after partitioning.

INTRA - QUERY PARALLELISM

Intra-Operator (Horizontal)
Inter-Operator (Vertical)
Bushy

INTRA - OPERATOR PARALLELISM¶

Approach #1: Intra-Operator (Horizontal)

Decompose operators into independent fragments that perform the same function on different subsets of data. 将运算符分解为对不同数据子集执行相同功能的独立片段

The DBMS inserts an exchange operator into the query plan to coalesce/split results from multiple children/parent operators. (Postgres calls this "gather")

coalesce = merge

alt text

INTER - OPERATOR PARALLELISM¶

Approach #2: Inter-Operator (Vertical)

Operations are overlapped in order to pipeline data from one stage to the next without materialization.
Workers execute operators from different segments of a query plan at the same time.
More common in streaming systems (continuous queries)

Also called pipeline parallelism.

alt text

流水线并行（Pipelining）

假设有一个简单的查询，步骤如下：

从表中读取数据。
对数据进行过滤。
将过滤后的数据进行聚合计算。

在传统的物化模型中，每个步骤都会等待前一个步骤完全完成并将中间结果存储后才会开始：

读取阶段：读取所有数据并存储。
过滤阶段：读取存储的数据，过滤并存储结果。
聚合阶段：读取过滤后的数据并进行聚合计算。

在流水线并行模型中，每个步骤可以同时进行：

读取阶段：工作者读取数据并立即将其传递给过滤阶段。
过滤阶段：工作者接收数据并立即进行过滤，然后将过滤结果传递给聚合阶段。
聚合阶段：工作者接收过滤数据并立即进行聚合计算。

这样，每个阶段的数据处理是重叠的，减少了中间存储和等待时间。

流处理系统中的应用

在流处理系统中，数据不断流入，系统需要持续处理：

数据从传感器或其他输入源流入。
读取数据的工作者不断将数据传递给过滤阶段的工作者。
过滤阶段的工作者实时处理数据并传递给下一个阶段的工作者。
聚合阶段的工作者实时计算结果并输出。

这种方式可以确保系统能够实时处理不断流入的数据，而不必等待整个批处理完成。

BUSHY PARALLELISM¶

Approach #3: Bushy Parallelism

Hybrid of intra- and inter-op parallelism where workers execute multiple operators from different segments of a query plan at the same time.
Still need exchange operators to combine intermediate results from segments.

alt text

I/O PARALLELISM¶

Split the DBMS across multiple storage devices to improve disk bandwidth latency.

Many different options that have trade-offs:

Multiple Disks per Database
One Database per Disk
One Relation per Disk
Split Relation across Multiple Disks

Some DBMSs support this natively. Others require admin to configure outside of DBMS.

MULTI - DISK PARALLELISM¶

Configure OS/hardware to store the DBMS's files across multiple storage devices.

Storage Appliances
RAID Configuration

This is transparent to the DBMS.

alt text

DATABASE PARTITIONING¶

partition 划分；分区

Some DBMSs allow you to specify the disk location of each individual database.

The buffer pool manager maps a page to a disk location.

This is also easy to do at the filesystem level if the DBMS stores each database in a separate directory.

The DBMS recovery log file might still be shared if transactions can update multiple databases.

这段文字在讨论数据库分区（Database Partitioning）的概念，以及如何在磁盘上指定每个数据库的位置，并且在文件系统级别进行存储管理。

数据库分区的关键点

指定数据库的磁盘位置：
- 有些数据库管理系统（DBMS）允许你为每个数据库指定它们在磁盘上的具体位置。这可以帮助优化存储和访问性能，因为你可以根据数据库的使用模式将其放置在不同的磁盘上。
缓冲池管理器映射页面到磁盘位置：
- 缓冲池管理器负责将数据库的页面映射到磁盘位置。这意味着缓冲池管理器知道每个页面在磁盘上的具体位置，并可以有效地进行读取和写入操作。
在文件系统级别实现：
- 如果DBMS将每个数据库存储在一个单独的目录中，那么在文件系统级别实现数据库分区是很容易的。
- 例如，每个数据库都有自己独立的目录，目录中的文件对应数据库的不同部分（例如表、索引等）。这样，文件系统就能管理每个数据库的存储位置。
共享的恢复日志文件：
- 尽管数据库可以分区并存储在不同的磁盘位置，但是如果事务可以更新多个数据库，那么DBMS的恢复日志文件可能还是共享的。
- 恢复日志文件记录了数据库操作的日志，用于在出现故障时进行恢复。如果一个事务涉及多个数据库的操作，这些操作的日志都需要记录在同一个恢复日志文件中，以确保事务的一致性和完整性。

示例解释

指定磁盘位置

假设有两个数据库：数据库A和数据库B。

数据库A：存储在磁盘1上的路径为/mnt/disk1/databaseA。
数据库B：存储在磁盘2上的路径为/mnt/disk2/databaseB。

通过这种方式，数据库A和数据库B的数据被分开存储在不同的磁盘上，可以优化存储性能和访问速度。

文件系统级别的实现

在文件系统级别，每个数据库都有一个单独的目录：

/mnt/disk1/databaseA 包含数据库A的所有文件。
/mnt/disk2/databaseB 包含数据库B的所有文件。

这种目录结构使得管理和备份每个数据库变得更加简单和高效。

共享的恢复日志文件

假设有一个事务同时更新了数据库A和数据库B中的数据。为了确保这个事务的原子性和一致性，DBMS会将这个事务的所有操作记录在同一个恢复日志文件中。

恢复日志文件路径：/mnt/disk1/recovery.log

这个日志文件包含了数据库A和数据库B的操作日志。在系统故障时，DBMS可以使用这个日志文件来恢复所有数据库到一致的状态。

小结
数据库分区：允许指定每个数据库在磁盘上的位置，可以优化存储和访问性能。
缓冲池管理器：负责将页面映射到磁盘位置，确保高效的读写操作。
文件系统级别实现：将每个数据库存储在单独的目录中，简化管理和备份。
共享恢复日志文件：当事务涉及多个数据库时，共享的恢复日志文件确保事务的一致性和完整性。

通过这种方式，DBMS可以实现高效的存储管理和数据恢复，同时确保系统的一致性和性能。

PARTITIONING¶

Split single logical table into disjoint physical segments that are stored/managed separately.

Partitioning should (ideally) be transparent to the application.

The application should only access logical tables and not have to worry about how things are physically stored.

We will cover this further when we talk about distributed databases after the mid-term.

这段文字在讨论数据库分区（Partitioning）的概念，特别是如何将一个逻辑表拆分成多个物理段，并将这些段独立存储和管理。这种分区应该对应用程序透明，也就是说，应用程序不需要知道数据是如何物理存储的。以下是对这段文字的详细解释：

数据库分区的关键点

逻辑表拆分为不相交的物理段：
- 一个单一的逻辑表可以被拆分成多个不相交的物理段（segments），每个段独立存储和管理。
- 这种分区方法有助于提高数据管理的灵活性和性能。例如，可以根据数据的某个属性（如日期或地理位置）将表划分为不同的段。
对应用程序透明：
- 分区对应用程序来说应该是透明的。也就是说，应用程序只需与逻辑表进行交互，而不需要关心数据在底层是如何物理存储和管理的。
- 这使得应用程序的设计和开发更为简单，因为它们不需要处理复杂的存储细节。

示例解释

假设有一个大型的客户订单表 orders，其中包含以下字段：

order_id
customer_id
order_date
total_amount

为了优化查询性能和存储管理，可以将这个表根据订单日期 order_date 进行分区。

分区示例

物理分区：
- orders_2021：存储2021年的订单数据
- orders_2022：存储2022年的订单数据
- orders_2023：存储2023年的订单数据

每个物理分区是一个独立的表，但它们共同表示一个逻辑表 orders。

透明性：

应用程序执行查询时，仍然像操作单一表 orders 一样：

SQL
SELECT * FROM orders WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31';

数据库系统负责将查询分解到合适的物理分区中（在这个例子中，是 orders_2022 分区），应用程序无需知道这些细节。

分布式数据库中的应用

当讨论分布式数据库时，分区将更为重要和复杂。分布式数据库中，数据可能分布在多个物理节点上：

每个物理段可能存储在不同的节点上，分布在不同的地理位置。
数据库系统需要处理跨节点的查询和数据一致性。

分区在分布式数据库中可以显著提高数据的可用性和访问速度，同时减少单个节点的负载。

小结

逻辑表拆分为物理段：将一个逻辑表分成多个独立存储和管理的物理段。
透明性：应用程序只与逻辑表交互，不需要关心底层物理存储细节。
分布式数据库：分区在分布式数据库中更加重要，用于提高性能和可用性。

通过这种方式，数据库系统能够更好地管理和优化大规模数据存储，同时简化应用程序的开发和维护。