跳转至

Chapter 11 Join Algorithms

Note

Since this part is of special difficulty in CMU-15445, I choose to use UCB CS186 instead.

This part is copied from UCB-CS186 notes

Here we ignored numerous process graphs which are essential for you to deeply understand the exact course, you can learn the details from here

Introduction

Let’s begin with the simplest question: what, exactly, is a join?

If you remember the SQL project, you’ll remember writing things like R INNER JOIN S ON R.name = S.name and other similar statements. What that actually meant is that you take two relations, \(R\) and \(S\), and create one new relation out of their matches on the join condition – that is, for each record \(r_i\) in \(R\), find all records \(s_j\) in \(S\) that match the join condition we have specified and write \(<r_i, s_j>\) as a new row in the output (all the fields of r followed by all the fields of s).

The SQL lecture slides are a great resource for more clarifications on what joins actually are. Before we get into the different join algorithms, we need to discuss what happens when the new joined relation consisting of \(<r_i, s_j>\) is formed. Whenever we compute the cost of a join, we will ignore the cost of writing the joined relation to disk. This is because we are assuming that the output of the join will be consumed by another operator involved later on in the execution of the SQL query. Often times this operator can directly consume the joined records from memory so we don’t need to write the joined table to disk.

Don’t worry if this sounds confusing right now; we will revisit it in the Query Optimization module, but the important thing to remember for now is that the final write cost is not included in our join cost models!

Comparison of Loop Joins

  • Simple Nested Loop Join
  • Page Nested Loop Join
  • Block Nested Loop Join

Simple Nested Loop Join

(SNLJ, just For-Loops)

Let’s start with the simplest strategy possible. Let’s say we have a buffer of B pages, and we wish to join two tables, \(R\) and \(S\), on the join condition \(\Theta\) (Theta). Starting with the most naïve strategy, we can take each record in \(R\), search for all its matches in \(S\), and then we yield each match. This is called simple nested loop join (SNLJ). You can think of it as two nested for loops:

Python
1
2
3
4
5
# pseudocode
for each record r_i in R:
    for each record s_j in S:
        if Theta(r_i,s_j):
            yield <r_i,s_j>
Caution⚠️

You must be careful that we're discussing I/Os times rather than Time-Complexity here.

This would be a great thing to do, but the theme of the class is really centered around optimization and minimizing I/Os. For that, this is a pretty poor scheme, because we take each record in \(R\) and read in every single page in \(S\) searching for a match.

The I/O cost of this would then be [R] + |R|[S] , where [R] is the number of pages in \(R\) and |R| is the number of records in R.

And while we might be able to optimize things a slight amount by switching the order of \(R\) and \(S\) in the for loop, this really isn’t a very good strategy. (review CSAPP Loop Unrolling)

[R] rather than |R|

SNLJ does not incur |R| I/Os to read every record in R.

It will cost [R] I/Os because it’s really doing something more like “for each page \(p_r\) in R: for each record r in \(p_r\): for each page \(p_s\) in S: for each record s in \(p_s\): join” since we can’t read less than a page at a time.

This point is also the reason for the statement "[R] + |R|[S]" rather than "[R] + |R|*|S|"

Question about the format above

为什么是 [R] + |R|[S] 而不是 [R] + |R|*|S|

这是因为I/O操作的基本单位是页(page),而不是记录(record)。每次对\(S\)的扫描都是按页进行的,而不是按单个记录进行的。

alt text

Page Nested Loop Join

It’s clear that we don’t want to read in every single page of \(S\) for each record of \(R\), so what can we do better? What if we read in every single page in \(S\) for every single page of \(R\) instead? That is, for a page of \(R\), take all the records and match them against each record in \(S\), and do this for every page of \(R\).

That’s called page nested loop join (PNLJ). Here’s the pseudocode for it:

Python
1
2
3
4
5
6
7
8
9
# pseudocode
for each page p_r in R:
    for each page p_s in S:

        for each record r_i in p_r:
            for each record s_j in p_s:

                if Theta(r_i,s_j):
                    yield <r_i,s_j>

The I/O cost of this is somewhat better. It’s \([R] + [R][S]\) , where \([R]\) is the number of pages in \(R\) and \([S]\) is the number of pages in \(S\).

This can be optimized by keeping the smaller relation between \(R\) and \(S\) as the outer one (keep this in mind when asked to find the lowest cost for performing the join).

Caution⚠️

We're discussing I/Os times rather than Time-Complexity here.

That's why the number is not \(|R| + |R|*|S|\).

alt text

Block Nested Loop Join

Page Nested Loop Join is a lot better! The only problem is that we’re still not fully utilizing our buffer as powerfully as we can. We have \(B\) buffer pages, but our algorithm only uses 3 – one for \(R\), one for \(S\), and one for the output buffer.

Remember that the fewer times we read in \(S\), the better – so if we can reserve \(B-2\) pages for \(R\) instead and match \(S\) against every record in each "chunk", we could cut down our I/O cost drastically!

In this join, \(B-2\) pages are for R, 1 page is for S, and 1 page is the output buffer for the join.

This is called Chunk Nested Loop Join (or Block Nested Loop Join). The key idea here is that we want to utilize our buffer to help us reduce the I/O cost, and so we can reserve as many pages as possible for a chunk of \(R\) – because we only read in each page of \(S\) once per chunk, larger chunks imply fewer I/Os. For each chunk of R, match all the records in \(S\) against all the records in the chunk.

Python
1
2
3
4
5
6
7
8
9
# pseudocode
for each chunk of B-2 pages B_r in R:
    for each page p_s in S:

        for each record r_i in B_r:
            for each record s_j in p_s:

                if Theta(r_i,s_j):
                    yield <r_i,s_j>

Then, the I/O cost of this can be written as \([R] + \frac{[R]}{B-2}[S]\). This is a lot better! Now, we’re taking advantage of our \(B\) buffer pages to reduce the number of times we have to read in \(S\).

alt text

Visual Comparison of Loop Joins

alt text

Index Nested Loop Join

There are times, however, when Block Nested Loop Join isn’t the best thing to do. Sometimes, if we have an index on \(S\) that is on the appropriate field (i.e. the field we are joining on), it can be very fast to look up matches of \(r_i\) in S. This is called index nested loop join, and the pseudocode goes like this:

Python
1
2
3
for each record r_i in R:
    for each record s_j in S where Theta(r_i,s_j) == True:
        yield <r_i,s_j>

The I/O cost is $[R] + |R| \times $(cost to look up matching records in S).

The cost to look up matching records in \(S\) will differ based on the type of index (alternative 1, 2, 3 and unclustered vs. clustered). If it is a B+ tree, we will search starting at the root and count how many I/Os it will take to get to a corresponding record. See the Clustering and Counting I/O’s sections of the B+ tree course notes.

Hash Join

Notice that in this entire sequence, we’re really trying to look for matching records. Hash tables are really nice for looking up matches, though; even if we don’t have an index, we can construct a hash table that is B-2 pages big on the records of R, fit it into memory, and then read in each record of \(S\) and look it up in R’s hash table to see if we can find any matches on it. This is called Naive Hash Join. Its cost is \([R] + [S]\) I/Os. That’s actually the best one we’ve done yet. It’s efficient, cheap, and simple.

There’s a problem with this, however; this relies on \(R\) being able to fit entirely into memory (specifically, having \(R\) being ≤ \(B-2\) pages big). And that’s often just not going to be possible.

Why \(R\)\(B-2\) pages big

You can take a look at "Block Nested Loop Join" above.

To fix this, we repeatedly hash \(R\) and \(S\) into B-1 buffers so that we can get partitions that are ≤ (\(B-2\)) pages big, enabling us to fit them into memory and perform a Naive Hash Join.

More specifically, consider each pair of corresponding partitions \(R_i\) and \(S_j\) (i.e. partition i of \(R\) and partition i of S). If \(R_i\) and \(S_j\) are both > B-2 pages big, hash both partitions into smaller ones. Else, if either \(R_i\) or \(S_j\) ≤ B-2 pages, stop partitioning and load the smaller partition into memory to build an in-memory hash table and perform a Naive Hash Join with the larger partition in the pair.

This procedure is called Grace Hash Join, and the I/O cost of this is: the cost of hashing plus the cost of Naive Hash Join on the subsections.

\({Cost}_{GraceHashJoin} = {Cost}_{hash} + {Cost}_{NaiveHashJoin}\)

Procedure
  1. Hash \(R\) and \(S\) into partitions of size B-1.
    • For each pair of corresponding partitions \(R_i\) and \(S_j\), if \(R_i\) and \(S_j\) are both > B-2 pages big, hash both partitions into smaller ones.
    • Else, if either \(R_i\) or \(S_j\) ≤ B-2 pages, stop partitioning and load the smaller partition into memory to build an in-memory hash table and perform a Naive Hash Join with the larger partition in the pair.
  2. Pseudocode:
Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def GraceHashJoin(r, s):
    if r > B-2 and s > B-2:
        hash(r)
        hash(s)
    else:
        minn = min(r, s)
        maxx = max(r, s)
        load_into_memory(m)
        hashTable.append(minn)
        NaiveHashJoin(m, maxx)

The cost of hashing can change based on how many times we need to repeatedly hash on how many partitions.

The cost of hashing a partition P includes the I/O’s we need to read all the pages in P and the I/O’s we need to write all the resulting partitions after hashing partition P.

The Naive Hash Join portion cost per partition pair is the cost of reading in each page in both partitions after you have finished. Grace Hash is great, but it’s really sensitive to key skew (密钥偏移), so you want to be careful when using this algorithm.

Key Skew

Key skew is when we try to hash but many of the keys go into the same bucket.

Key skew happens when many of the records have the same key. For example, if we’re hashing on the column which only has "yes" as values, then we can keep hashing but they will all end up in the same bucket no matter which hash function we use.

Sort-Merge Join

There’s also times when it helps for us to sort R and S first, especially if we want our joined table to be sorted on some specific column. In those cases, what we do is first sort R and S. Then:

(1) We begin at the start of \(R\) and \(S\) and advance one or the other until we get to a match (if \(R_i < S_j\), advance \(R\); else if \(R_i > S_j\), advance \(S\) – the idea is to advance the lesser of the two until we get to a match).

Python
1
2
3
4
if R_i < S_j:
    advance(R) till R_i == S_j
else:
    advance(S) till R_i == S_j

(2) Now, let’s assume we’ve gotten to a match. Let’s say this pair is \((R_i, S_j)\). We mark this spot in \(S\) as \(S_j\) and check each subsequent record in \(S\) (\(S_{j}, S_{j+1}, S_{j+2}\), etc.) until we find something that is not a match (i.e. read in all records in \(S\) that match \(R_i\)).

Python
1
2
3
if R_i == S_j:
    while R_i == S_j:
        advance(S)

(3) Now, go to the next record in \(R\) and go back to the marked spot in \(S\) and begin again at step 1 (except instead of beginning at the start of \(R\) and the start of \(S\), do it at the indices we just indicated) – the idea is that because \(R\) and \(S\) are sorted, any match for any future record of \(R\) cannot be before the marked spot in \(S\), because this record \(R_i\) – if a match for \(R_i\) did not exist before \(S_j\), a match for \(R_k\) cannot possibly be before \(S_j\) either! So we scroll from the marked spot in \(S\) until we find a match for \(R_k\).

Text Only
1
2
3
4
5
6
7
advance(R)
go_to_marked_spot(S)

if R_i < S_j:
    advance(R) till R_i == S_j
else:
    advance(S) till R_i == S_j

This is called Sort-Merge Join.

The average I/O cost is: cost_to_sort \(R\) + cost_to_sort \(S\) + \(([R]+[S])\) (though it is important to note that this is not the worst case!). In the worst case, if each record of \(R\) matches every record of \(S\), the last term becomes \((|R| \times [S])\). The worst case cost is then: cost_to_sort \(R\) + cost_to_sort \(S\) + \((|R| \times [S])\). That generally doesn’t happen, though.

C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
do {
    if (!mark) {
        while (r < s) { advance r }
        while (r > s) { advance s }
        // mark start of “block” of S
        mark = s
     }
    if (r == s) {
        result = <r, s>
        advance s
        yield result
    }
    else {
        reset s to mark
        advance r
        mark = NULL
    }
}

Let’s take a look at an example. Let the table on the left be \(R\) and the table on the right be \(S\).

alt text

We will advance the pointer (the red arrow) on \(S\) because 28 < 31 until \(S\) gets to sid of 31. Then we will mark this record (the black arrow). In addition, we will output this match (31==31).

alt text

Then we will advance the pointer on \(S\) again and we get another match and output it.

alt text

We advance the pointer on \(S\) again, but we do not get a match. We then reset \(S\) to where we marked (the black arrow) and then advance \(R\). When we advance \(R\), we get another match so we output it.

alt text

alt text

We then advance \(S\), we get another match so we output it.

alt text

An Important Refinement

You can learn more details here