Chapter 6 Memory Management¶

How the DBMS manages its memory and move data back-and-forth from disk.

Outline

Buffer Pool Manager
Replacement Policies
Other Memory Pools

DATABASE STORAGE¶

Spatial Control (空间局部性):

Where to write pages on disk.
The goal is to keep pages that are used together often as physically close together as possible on disk.

Temporal Control (时间局部性):

When to read pages into memory, and when to write them to disk.
The goal is to minimize the number of stalls from having to read data from disk.(目标是尽量减少从磁盘读取数据的停顿次数)

alt text

It's interesting that the same page may go to different place in buffer pool in different turn. (for hash-table cannot maintain sequence)

alt text

BUFFER POOL ORGANIZATION¶

Memory region organized as an array of fixed-size pages.
An array entry is called a frame.
When the DBMS requests a page, an exact copy is placed into one of these frames.
Dirty pages are buffered and not written to disk immediately.
- Write-Back Cache

Dirty Pages

在操作系统和数据库管理系统中，“dirty pages” 是指已被修改但尚未写入磁盘的内存页。具体来说：

内存页（Memory Pages）：内存页是操作系统用来管理内存的基本单位，通常为固定大小的内存块（例如4KB）。
Dirty Pages：当一个内存页的内容被修改（例如数据更新或插入操作），但这些修改还没有被写回到磁盘上的持久存储时，这个内存页就被称为“dirty page”。

“Dirty pages” 是一个重要概念，因为它们包含了最新的数据变化，但这些变化尚未持久化。如果系统崩溃或掉电，这些变化可能会丢失。因此，操作系统或数据库管理系统通常会有一个机制，定期将“dirty pages”写回到磁盘，以确保数据的一致性和持久性。

当说“dirty pages are buffered and not written to disk immediately”时，这意味着这些修改过的内存页会暂时保存在内存中（buffered），而不是立即写入磁盘。这种做法可以提高系统的性能，因为写入磁盘是一个相对较慢的操作，通过延迟写入，可以更有效地利用缓存并减少磁盘写操作的频率。

alt text

BUFFER POOL META - DATA¶

The page table keeps track of pages that are currently in memory.

Also maintains additional meta-data per page:

Dirty Flag
Pin/Reference Counter

alt text

We will explain "LOCK opt" below.

LOCKS VS. LATCHES¶

Locks:

Protects the database's logical contents from other transactions.
Held for transaction duration. (long-time)
Need to be able to rollback changes. (rollback ✅)

Latches:

Mutex（互斥锁）

Protects the critical sections of the DBMS's internal data structure from other threads.
Held for operation duration. (short-time)
Do not need to be able to rollback changes. (rollback ❌)

Lock && Latch

锁（Locks）

目的：用于管理并发事务对共享资源（如数据库表或行）的访问，以确保数据的一致性和完整性
粒度：通常粒度较大，可能是数据库、表、行甚至是更细粒度的列
持续时间：锁的 持续时间较长，通常会保持到事务结束 （提交或回滚）
开销：由于锁需要维持较长时间，所以开销相对较高
使用场景：主要用于数据库系统，以确保事务在并发执行时的数据一致性

闩（Latches）

目的：用于保护内部数据结构在短时间内不被并发修改，确保数据结构的一致性
粒度：通常粒度较小，具体到内存数据结构，如缓冲池中的页
持续时间：闩的 持续时间很短，通常仅在操作的关键部分持有
开销：由于闩的持有时间较短，所以开销较低
使用场景：主要用于数据库管理系统的内部，用于保护内存中的数据结构，例如缓冲池、日志缓冲等

Example

锁：当一个事务正在 读取或写入数据库中的某一行数据 时，数据库系统会使用锁来确保其他事务在此期间无法修改这行数据，直到当前事务完成。

闩：在数据库系统中，当需要 访问缓冲池中的某个页 时，系统会使用闩来确保在读取或写入该页的过程中，不会有其他操作干扰。

PAGE TABLE VS. PAGE DIRECTORY¶

The page directory is the mapping from page ids to page locations in the database files. All changes must be recorded on disk to allow the DBMS to find on restart.

页目录是从页ID到数据库文件中页位置的映射。所有更改都必须记录在磁盘上，以便DBMS在重启时找到。

The page table is the mapping from page ids to a copy of the page in buffer pool frames. This is an in-memory data structure that does not need to be stored on disk.

页表是从页ID到缓冲池帧中页副本的映射。这是一个内存数据结构，不需要存储在磁盘上。

alt text

ALLOCATION POLICIES¶

Global Policies:

Make decisions for all active queries.

Local Policies:

Allocate frames to a specific queries without considering the behavior of concurrent queries.
Still need to support sharing pages.

BUFFER POOL OPTIMIZATIONS¶

Multiple Buffer Pools
Pre-Fetching
Scan Sharing
Buffer Pool Bypass

MULTIPLE BUFFER POOLS¶

The DBMS does not always have a single buffer pool for the entire system.

Multiple buffer pool instances
Per-database buffer pool
Per-page type buffer pool

Partitioning memory across multiple pools helps reduce latch contention and improve locality.

跨多个池划分内存有助于减少闩锁争用并改善局部性。

Note

减少闩锁竞争

减少争用
- 当多个线程或进程需要访问相同的资源时，往往需要使用闩锁来保护这些资源。将内存划分为多个池后， 不同的线程或进程可以访问不同的内存池 ，从而减少了它们之间的争用。
- 由于 每个内存池有自己的锁定机制 ，线程或进程可以 独立访问它们各自的内存池 ，这样可以显著减少闩锁争用，进而提高系统的并发性能。
提高并行性
- 多个内存池允许更多的并行操作，因为每个池可以同时被不同的线程或进程访问。这样可以充分利用多核处理器的优势，提高系统的整体吞吐量。

改善局部性

缓存局部性
- 当内存 被划分为多个池时，每个池内的数据通常会更紧密地相关 。这有助于提高缓存的命中率，因为访问同一个内存池的数据通常会在较小的内存范围内，这样可以更好地利用CPU缓存。
- 例如，如果一个线程频繁访问某个内存池中的数据，那么这些数据更有可能被保存在缓存中，从而减少内存访问延迟。
数据局部性
- 内存池内的数据通常具有更好的空间局部性和时间局部性。空间局部性指的是程序访问的内存地址在一段时间内趋向于集中在一个小的区域内，而时间局部性指的是最近被访问过的数据很有可能在不久的将来再次被访问。
- 划分多个内存池后，每个池内的数据更可能保持局部性，从而减少内存访问延迟和提升性能。

FORMAT

SQL
# DB2
CREATE BUFFERPOOL custom_pool
    -> SIZE 250 PAGESIZE 8k
CREATE TABLESPACE custom_tablespace
    -> PAGESIZE 8k BUFFERPOOL custom_pool;
CREATE TABLE new_table
    -> TABLESPACE custom_tablespace (......);

HOW TO USE

Approach #1: Object Id <ObjId, PageId, SlotNum>

Embed an object identifier in record ids and then maintain a mapping from objects to specific buffer pools.

Approach #2: Hashing HASH(123) % n

Hash the page id to select which buffer pool to access.

alt text

PRE-FETCHING¶

The DBMS can also prefetch pages based on a query plan.

Sequential Scans
Index Scans

alt text

Normally we use sequence scan, hence we scan 0 and then 1.
When we scan 1, we wanna pre-fetch, then we choose to load 2 and 3.
Since there is only 1 empty left, so we drop 0 for space.
We load 2 and 3.
Then we just jump 2 && 3 and jump to scan 4 now.
We scan 4/5 and replace 1/2 respectively as usual.

PRE-FETCHING with B+ tree

alt text

Queries can reuse data retrieved from storage or operator computations.

Also called synchronized scans.
This is different from result caching.

Allow multiple queries to attach to a single cursor that scans a table.

Queries do not have to be the same.
Can also share intermediate results.

Note

同步扫描
- 这种技术允许多个查询 重用已经从存储中检索到的数据 或由操作符计算出的数据
- 这 不同于结果缓存 ，结果缓存存储查询的最终结果以供重用。 同步扫描允许在执行过程中共享数据
多个查询共享游标
- 多个查询可以附加到扫描表的单个游标上
- 这些 查询不需要相同 ；它们可以 是不同的查询，只是恰好需要重叠的数据
- 扫描的中间结果也可以在查询之间共享，通过避免冗余的数据检索和计算来提高效率

If a query wants to scan a table and another query is already doing this, then the DBMS will attach the second query's cursor to the existing cursor.

BUFFER POOL BYPASS¶

The sequential scan operator will not store fetched pages in the buffer pool to avoid overhead.

Memory is local to running query.
Works well if operator needs to read a large sequence of pages that are contiguous on disk.
Can also be used for temporary data (sorting, joins).

Called "Light Scans" in Informix.

OS PAGE CACHE¶

Most disk operations go through the OS API. Unless the DBMS tells it not to, the OS maintains its own filesystem cache (aka page cache, buffer cache).

aka = alias

Most DBMSs use direct I/O (O_DIRECT) to bypass the OS's cache.

Redundant copies of pages.
Different eviction policies.
Loss of control over file I/O.

alt text

BUFFER REPLACEMENT POLICIES¶

When the DBMS needs to free up a frame to make room for a new page, it must decide which page to evict from the buffer pool.

Goals

Correctness
Accuracy
Speed
Meta-data Overhead

LRU¶

Maintain a single timestamp of when each page was last accessed.

When the DBMS needs to evict a page, select the one with the oldest timestamp.

PS: Keep the pages in sorted order to reduce the search time on eviction

CLOCK¶

The Clock algorithm is a more efficient approximation of the Least Recently Used (LRU) page replacement algorithm, which avoids the need for maintaining exact usage timestamps for each page.

Instead, it uses a single reference bit for each page and organizes the pages in a circular buffer, resembling the face of a clock. Here's a step-by-step explanation of how the Clock algorithm works:

Structure and Setup¶

Reference Bit: Each page has a reference bit associated with it.
Circular Buffer: Pages are arranged in a circular buffer. This structure allows us to easily iterate over the pages in a circular manner.
Clock Hand: A pointer, known as the "clock hand," indicates the current position in the circular buffer.

Algorithm Steps¶

Page Access: When a page is accessed (read or written), its reference bit is set to 1. This indicates that the page has been recently used.
Page Replacement (when a new page needs to be loaded):
The clock hand sweeps over the pages in the circular buffer.
For each page it encounters:
- Check Reference Bit: If the reference bit is 1, it means the page was recently used. The algorithm sets the reference bit to 0 and moves the clock hand to the next page.
- Reference Bit is 0: If the reference bit is 0, it means the page has not been used recently. This page is considered for replacement.
Eviction: When a page with a reference bit of 0 is found, the new page is loaded into this position, and the clock hand is advanced to the next position.

Example Walkthrough¶

Let's go through an example to clarify:

Initial State:
Pages: [A, B, C, D]
Reference Bits: [0, 0, 0, 0]
Clock Hand: Points to page A
Page Access:
Access page A: Set A's reference bit to 1.
Access page C: Set C's reference bit to 1.
State:
- Pages: [A, B, C, D]
- Reference Bits: [1, 0, 1, 0]
- Clock Hand: Points to page A
Page Replacement Needed (e.g., loading page E):
Clock hand points to A:
- A's reference bit is 1 → Set to 0 and move the clock hand to B.
Clock hand points to B:
- B's reference bit is 0 → Replace B with E.
New State:
- Pages: [A, E, C, D]
- Reference Bits: [0, 1, 1, 0]
- Clock Hand: Points to page C

Advantages¶

Efficiency: The Clock algorithm avoids the overhead of maintaining precise usage information, making it more efficient than exact LRU.
Simplicity: The algorithm is simple to implement with just a reference bit and a circular buffer.

In database systems, the Clock algorithm is often used for buffer management to decide which pages to evict from memory when new pages need to be loaded. Its efficiency and simplicity make it well-suited for handling large datasets and high transaction volumes.

SEQUENTIAL FLOODING¶

LRU and CLOCK replacement policies are susceptible (易受影响的) to sequential flooding.

A query performs a sequential scan that reads every page.
This pollutes the buffer pool with pages that are read once and then never again.

In some workloads the most recently used page is the most unneeded page.

This is called "Buffer Pool pollution"

Example:

Initial State:
- Assume a buffer pool can hold 4 pages and initially contains pages [A, B, C, D].
Sequential Scan:
- The query reads page E, evicting A (if using LRU or a similar policy).
- Then it reads page F, evicting B.
- This continues until the buffer pool is filled with pages [E, F, G, H].
Result:
- Pages [A, B, C, D] are replaced with [E, F, G, H], even though E, F, G, H are unlikely to be needed again soon.

BETTER POLICIES: LRU-K¶

The LRU-K algorithm is an advanced page replacement policy that improves upon the basic LRU algorithm by considering the history of page accesses. It tracks the last K references to each page to make more informed decisions about which pages to evict. Here’s a detailed explanation of how LRU-K works:

Key Concepts

K References: The algorithm keeps track of the last K timestamps (or references) for each page. Common choices are K=2 or K=3.
Intervals: It computes the intervals between subsequent accesses to estimate the likelihood of future accesses.

How It Works

Tracking References:
- For each page, maintain a history of the last K timestamps when the page was accessed.
- For instance, if K=2, you keep the timestamps of the last two times the page was accessed.
Estimating Access Patterns:
- By examining the intervals between these K accesses, the algorithm estimates how frequently the page is accessed.
- Pages with shorter intervals between accesses are considered more likely to be accessed again soon.
Page Replacement Decision:
- When a page needs to be evicted, the DBMS examines the intervals for all pages.
- Pages with longer intervals (indicating they are accessed less frequently) are more likely to be evicted.
- If a page has fewer than K references, it might be prioritized for eviction, as there isn’t enough data to reliably predict its access pattern.

Example Walkthrough

Consider an example with K=2:

Initial State

Suppose the buffer pool can hold 4 pages: [A, B, C, D].
Track the last 2 access times for each page.

Access Sequence

Access Page A:
- Timestamps for A: [T1]
Access Page B:
- Timestamps for B: [T2]
Access Page C:
- Timestamps for C: [T3]
Access Page D:
- Timestamps for D: [T4]
Access Page A again:
- Timestamps for A: [T1, T5]

Eviction Decision

Assume we need to load Page E:
- Compare intervals:
  - Page A: Intervals [T5 - T1]
  - Page B: Not enough data, only [T2]
  - Page C: Not enough data, only [T3]
  - Page D: Not enough data, only [T4]
- Evict: Page B, C, or D could be chosen since they have less historical data indicating frequent use.

Advantages of LRU-K

Better Prediction: By using the history of the last K accesses, LRU-K can better predict which pages are likely to be needed soon.
Flexibility: Adjusting K allows for tuning the algorithm to different access patterns. For instance, K=2 balances between recent and frequent usage, while higher K values provide even more historical context.
Reduced Sequential Flooding Impact: Unlike simple LRU, LRU-K is less likely to be misled by sequential scans, as it requires multiple accesses to establish a page’s importance.

BETTER POLICIES: LOCALIZATION¶

The per transaction/query eviction strategy with a private ring buffer helps a DBMS like PostgreSQL manage its buffer pool more effectively.

In fact, it's similar to "Round-Robin" :)

It ensures that sequential scans and similar operations do not excessively pollute the buffer pool with pages that are read once and not used again, maintaining overall system performance and efficiency.

This technique isolates the impact of each query's page accesses, ensuring that frequently accessed pages are prioritized in the main buffer pool.

Postgres maintains a small ring buffer that is private to the query.

Steps:

(1) Initialize Ring Buffer:

A small ring buffer of size 3 is allocated for the query.
Initially empty: [].

(2) Sequential Scan Accesses Pages:

Reads page P1: Ring buffer: [\P1].
Reads page P2: Ring buffer: [P1, P2].
Reads page P3: Ring buffer: [P1, P2, P3].

(3) Ring Buffer Full:

Reads page P4: Ring buffer: [P4, P2, P3] (replaces P1).
Reads page P5: Ring buffer: [P4, P5, P3] (replaces P2).
Reads page P6: Ring buffer: [P4, P5, P6] (replaces P3).

(4) Eviction Decisions:

Only if the query accesses a page outside the ring buffer multiple times, it may be promoted to the main buffer pool.
This ensures only frequently accessed pages during the query’s execution impact the main buffer pool.

BETTER POLICIES: PRIORITY HINTS¶

The DBMS knows about the context of each page during query execution.

It can provide hints (提示) to the buffer pool on whether a page is important or not.

alt text

DIRTY PAGES¶

Fast Path: If a page in the buffer pool is not dirty, then the DBMS can simply "drop" it.

Slow Path: If a page is dirty, then the DBMS must write back to disk to ensure that its changes are persisted.

BACKGROUND WRITING¶

The DBMS can periodically walk through the page table and write dirty pages to disk.

When a dirty page is safely written (from memory to disk), the DBMS can either evict the page or just unset the dirty flag.

Need to be careful that the system doesn't write dirty pages before their log records are written…

Chapter 6 Memory Management¶

DATABASE STORAGE¶

BUFFER POOL ORGANIZATION¶

BUFFER POOL META - DATA¶

LOCKS VS. LATCHES¶

PAGE TABLE VS. PAGE DIRECTORY¶

ALLOCATION POLICIES¶

BUFFER POOL OPTIMIZATIONS¶

MULTIPLE BUFFER POOLS¶

PRE-FETCHING¶

SCAN SHARING¶

BUFFER POOL BYPASS¶

OS PAGE CACHE¶

BUFFER REPLACEMENT POLICIES¶

LRU¶

CLOCK¶

Structure and Setup¶

Algorithm Steps¶

Example Walkthrough¶

Advantages¶

SEQUENTIAL FLOODING¶

BETTER POLICIES: LRU-K¶

BETTER POLICIES: LOCALIZATION¶

BETTER POLICIES: PRIORITY HINTS¶

DIRTY PAGES¶

BACKGROUND WRITING¶