Chapter 7 Hash Tables¶

We are now going to talk about how to support the DBMS's execution engine to read/write data from pages.

Two types of data structures:

Hash Tables
Trees

HASH TABLES¶

A hash table implements an unordered associative array that maps keys to values.

It uses a hash function to compute an offset into this array for a given key, from which the desired value can be found.

Space Complexity: O(n)
Time Complexity:
- Average: O(1)
- Worst: O(n)

Database need to care about constants

STATIC HASH TABLE¶

Assumption #1: Number of elements is known ahead of time and fixed.
Assumption #2: Each key is unique.
Assumption #3: Perfect hash function.
- If key1 ≠ key2, then hash(key1) ≠ hash(key2)

We do not want to use a cryptographic hash function for DBMS hash tables (e.g., SHA-2).

We want something that is fast and has a low collision rate.

STATIC HASHING SCHEMES¶

LINEAR PROBE HASHING¶

Single giant table of slots.
Resolve collisions by linearly searching for the next free slot in the table.
- To determine whether an element is present, hash to a location in the index and scan for it.
- Must store the key in the index to know when to stop scanning.
- Insertions and deletions are generalizations of lookups.

INSERT¶

alt text

DELETE¶

We use "Tombstone"

Set a marker to indicate that the entry in the slot is logically deleted.
You can reuse the slot for new keys.
May need periodic garbage collection.

alt text

NON-UNIQUE KEYS¶

alt text

ROBIN HOOD HASHING¶

Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys.

Each key tracks the number of positions they are from where its optimal position in the table.
On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key.

Note

A beats B if and only if A[Num] > B[Num]

alt text

CUCKOO HASHING¶

Use multiple hash tables with different hash function seeds.

On insert, check every table and pick anyone that has a free slot.
If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

alt text

Thinking

The previous hash tables require the DBMS to know the number of elements it wants to store.

Dynamic hash tables resize themselves on demand.

Chained Hashing
Extendible Hashing
Linear Hashing

DYNAMIC HASHING SCHEMES¶

CHAINED HASHING¶

Maintain a linked list of buckets for each slot in the hash table.

Resolve collisions by placing all elements with the same hash key into the same bucket.

To determine whether an element is present, hash to its bucket and scan for it.
Insertions and deletions are generalizations of lookups.

alt text

EXTENDIBLE HASHING¶

Chained-hashing approach where we split buckets instead of letting the linked list grow forever.

Multiple slot locations can point to the same bucket chain.

Reshuffle bucket entries on split and increase the number of bits to examine. - Data movement is localized to just the split chain.

alt text

LINEAR HASHING¶

alt text

When any bucket overflows, split the bucket at the pointer location by adding a new slot entry, and create a new hash function.
If the hash function maps to slot that has previously been pointed to by pointer, apply the new hash function.
When the pointer reaches last slot, delete original hash function and replace it with a new hash function.