Chapter 10 Sorting & Aggregations¶

We are now going to talk about how to execute queries using the DBMS components we have discussed so far.

alt text

PRE-QUESTIONS¶

QUERY PLAN¶

The operators are arranged in a tree.

Data flows from the leaves of the tree up towards the root.

We will discuss the granularity of the data movement next week.

The output of the root node is the result of the query.

alt text

DISK - ORIENTED DBMS¶

Just like it cannot assume that a table fits entirely in memory, a disk-oriented DBMS cannot assume that query results fit in memory.

We will use the buffer pool to implement algorithms that need to spill to disk.

We are also going to prefer algorithms that maximize the amount of sequential I/O.

WHY DO WE NEED TO SORT?¶

Relational model/SQL is unsorted.

1. What we need is sorted elements

Queries may request that tuples are sorted in a specific way (ORDER BY).

2. Better for us to operate

But even if a query does not specify an order, we may still want to sort to do other things:

Trivial to support duplicate elimination (DISTINCT)
Bulk loading sorted tuples into a B+Tree index is faster
Aggregations (GROUP BY)

IN-MEMORY SORTING¶

If data fits in memory, then we can use a standard sorting algorithm like quicksort.

If data does not fit in memory, then we need to use a technique that is aware of the cost of reading and writing disk pages…

Note

"如果数据无法完全放入内存"

就是说，数据量太大，无法全部加载到内存中处理。这时，我们需要使用一种能够认识到读写磁盘页面成本的技术来处理数据 :)

TOP-N HEAP SORT¶

If a query contains an ORDER BY with a LIMIT, then the DBMS only needs to scan the data once to find the top-N elements.

Ideal scenario for heap sort if the top-N elements fit in memory.

Scan data once, maintain an sorted priority queue.

We can review priority-queue this part.

Usage Condition

What we just need: top-N elements
All fit in memory

EXTERNAL MERGE SORT¶

Divide-and-Conquer algorithm that splits data into separate runs, sorts them individually, and then combines them into longer sorted runs.

Phase #1 – Sorting

Sort chunks of data that fit in memory and then write back the sorted chunks to a file on disk.

Phase #2 – Merging

Combine sorted runs into larger chunks.

SORTED RUN¶

A run is a list of key/value pairs.

Key: The attribute(s) to compare to compute the sort order.

Value: Two choices

Tuple (early materialization).
Record ID (late materialization).

Note

在数据库系统中，Record ID（记录ID）通常指的是数据库中每条记录的唯一标识符，用于在存储中快速查找和定位记录。这里的Record ID可以理解为一种延迟物化策略的一部分。

以下是对相关术语的解释：

Run：排序算法中的一个阶段性结果，是一个键值对列表。每个键值对包括一个键（用于排序的属性）和一个值（代表记录的实际内容或位置）。
Key：用于排序的属性或属性组合。
Value：在排序过程中，每个键对应的值有两种选择：
- Tuple (early materialization)：直接使用完整的元组（即记录的实际数据）。这种方法在排序时就将所有数据加载到内存中。
- Record ID (late materialization)：使用记录的唯一标识符（Record ID），而 不是实际数据。这种方法在排序过程中仅使用记录ID，而 不加载实际数据，直到需要实际数据时才进行访问。

Record ID（记录ID）的这种方式被称为延迟物化（late materialization），它的优势在于减少内存使用，因为在排序过程中只需处理较小的ID，而不是完整的数据记录。这对内存资源有限的系统特别有用。

2-WAY EXTERNAL MERGE SORT¶

We will start with a simple example of a 2-way external merge sort.

“2” is the number of runs that we are going to merge into a new run for each pass.

Data is broken up into N pages.

The DBMS has a finite number of B buffer pool pages to hold input and output data.

Pass #0

Read all B pages of the table into memory
Sort pages into runs and write them back to disk

Pass #1,2,3,…

Recursively merge pairs of runs into runs twice as long
Uses three buffer pages (2 for input pages, 1 for output)

alt text

DOUBLE BUFFERING OPTIMIZATION¶

Prefetch the next run in the background and store it in a second buffer while the system is processing the current run.

Reduces the wait time for I/O requests at each step by continuously utilizing the disk.

双重缓冲优化

双重缓冲优化：这是一种技术，通过使用两个缓冲区来提高数据处理效率。

后台预取下一个运行：当系统正在处理当前的运行时，它会在后台预先读取下一个运行的数据。

存储在第二个缓冲区：预读取的数据会被存储在一个独立的缓冲区中，而不是与当前运行的数据混在一起。

减少等待时间：因为数据已经在后台被预取到第二个缓冲区，当系统需要处理下一个运行时，无需等待数据从磁盘读取的时间，从而减少了等待时间，提高了效率。

连续利用I/O请求：这种方法确保磁盘I/O请求是连续进行的，不会因为等待数据从磁盘读取而中断，从而提高了系统整体的I/O性能。

GENERAL EXTERNAL MERGE SORT¶

Pass #0

Use B buffer pages
Produce ⌈N/B⌉ sorted runs of size B

Pass #1,2,3,…

Merge B-1 runs (i.e., K-way merge)

Summary

Number of passes = \(1 + ⌈log_{B-1}⌈N/B⌉⌉\)
Total I/O Cost = \(2N * (\# {of} {passes})\)

alt text

COMPARISON OPTIMIZATIONS¶

Approach #1: Code Specialization

Instead of providing a comparison function as a pointer to sorting algorithm, create a hardcoded version of sort that is specific to a key type.

Approach #2: Suffix Truncation

First compare a binary prefix of long VARCHAR keys instead of slower string comparison. Fallback to slower version if prefixes are equal.

Note

方法一：代码专用化

“与其将比较函数作为指针传递给排序算法，不如创建一个针对特定键类型的硬编码版本的排序算法。”

在一般的排序算法中，通常会通过传递一个比较函数来确定排序顺序。例如，在C语言中，你可能会将一个比较函数的指针传递给qsort函数。这种方式虽然灵活，但由于每次比较都需要通过函数指针调用，可能会带来性能开销。

代码专用化的思路是为每种特定的键类型（例如整数、浮点数、字符串等）创建一个专门的排序算法，这个算法是硬编码的，不需要通过函数指针调用比较函数。这种方式虽然减少了灵活性，但可以显著提高性能，因为它消除了函数指针调用的开销。

方法二：后缀截断

“首先比较长VARCHAR键的二进制前缀，而不是进行较慢的字符串比较。如果前缀相等，则回退到较慢的版本。”

对于长字符串（VARCHAR类型的键），完整的字符串比较会非常耗时。为了提高比较效率，可以先比较字符串的前缀（即前几位字符的二进制表示）。这样可以快速排除很多不相等的情况。如果前缀相同，再进行完整的字符串比较。这种方式大大减少了需要进行完整比较的次数，从而提高了性能。

USING B+TREES FOR SORTING¶

If the table that must be sorted already has a B+Tree index on the sort attribute(s), then we can use that to accelerate sorting.

Retrieve tuples in desired sort order by simply traversing the leaf pages of the tree.

Cases to consider:

Clustered B+ Tree
Unclustered B+ Tree

CLUSTERED B+ TREE¶

This is a good position! Node is data!

Traverse to the left-most leaf page, and then retrieve tuples from all leaf pages.

This is always better than external sorting because there is no computational cost, and all disk access is sequential.

alt text

UNCLUSTERED B+TREE¶

Bad situation! Node is just an index, we need to chase each pointer :(

Chase each pointer to the page that contains the data.

This is almost always a bad idea.

In general, one I/O per data record.

alt text

AGGREGATIONS¶

Collapse values for a single attribute from multiple tuples into a single scalar value.

The DBMS needs a way to quickly find tuples with the same distinguishing attributes for grouping.

Two implementation choices:

Sorting
Hashing

我们在这里讨论的是数据库管理系统（DBMS）如何将多个元组中的单个属性的值聚合成一个标量值，并且需要一种快速找到具有相同区分属性的元组的方法，以便进行分组。这里有两种实现选择：排序和哈希

为了实现上述分组和聚合操作，DBMS有两种主要的技术选择：

排序（Sorting）：将数据按区分属性进行排序，然后遍历排序后的数据，将具有相同属性值的元组分组在一起。排序后的数据可以很容易地识别出连续的具有相同属性值的元组。
哈希（Hashing）：使用哈希表来实现分组。将区分属性作为键，将元组作为值插入哈希表中。具有相同区分属性的元组会被放入同一个哈希桶中，这样可以快速找到并分组相同属性值的元组。

SORTING AGGREGATION¶

We need to sort these chaos data and let them in corresponding group (sorted-well).

alt text

ALTERNATIVES TO SORTING¶

What if we do not need the data to be ordered?

Forming groups in GROUP BY (no ordering)
Removing duplicates in DISTINCT (no ordering)

Hashing is a better alternative in this scenario.

Only need to remove duplicates, no need for ordering.
Can be computationally cheaper than sorting.

We just categorize them into different groups (hashed-well). We don't need to maintain the orders in each group.

HASHING AGGREGATE¶

Populate (填充) an ephemeral (临时的) hash table as the DBMS scans the table.

For each record, check whether there is already an entry in the hash table:

MEANING:

DISTINCT: Discard duplicate
GROUP BY: Perform aggregate computation

If everything fits in memory, then this is easy :-)

If the DBMS must spill data to disk, then we need to be smarter :(

Note

在DBMS扫描表时，填充一个临时哈希表。对于每条记录，检查哈希表中是否已经存在相应条目：

当DBMS扫描表中的数据时，它会动态地将数据插入一个临时的哈希表中。这个哈希表是短暂的，只在当前操作期间存在。对于每条记录，DBMS会检查哈希表中是否已经存在该记录的条目。

DISTINCT：丢弃重复项
- 如果操作是要找出不同的值（使用DISTINCT关键字），那么当检查到哈希表中已经存在相应条目时，就丢弃这个重复的记录。这种方式确保了每个值在结果集中只出现一次。
GROUP BY：执行聚合计算
- 如果操作是分组（使用GROUP BY关键字），那么在检查到哈希表中已经存在相应条目时，不是丢弃记录，而是对现有条目进行聚合计算（例如计算总和、平均值、计数等）。这种方式通过哈希表快速找到相同分组的记录，并进行相应的聚合操作。

PHASE #1 – PARTITION¶

Use a hash function \(h_1\) to split tuples into partitions on disk.

A partition is one or more pages that contain the set of keys with the same hash value.
Partitions are “spilled” to disk via output buffers.

Assume that we have \(B\) buffers.

We will use \(B-1\) buffers for the partitions and 1 buffer for the input data.

alt text

PHASE #2 – REHASH¶

For each partition on disk:

Read it into memory and build an in-memory hash table based on a second hash function h2 .
Then go through each bucket of this hash table to bring together matching tuples.

This assumes that each partition fits in memory.

alt text

HASHING SUMMARIZATION¶

During the ReHash phase, store pairs of the form (GroupKey - RunningVal)

When we want to insert a new tuple into the hash table:

If we find a matching GroupKey, just update the RunningVal appropriately.
Else insert a new GroupKey - RunningVal.

alt text