跳转至

Chapter 7 Hash Tables

We are now going to talk about how to support the DBMS's execution engine to read/write data from pages.

Two types of data structures:

  • Hash Tables
  • Trees

HASH TABLES

A hash table implements an unordered associative array that maps keys to values.

It uses a hash function to compute an offset into this array for a given key, from which the desired value can be found.

  • Space Complexity: O(n)
  • Time Complexity:
    • Average: O(1)
    • Worst: O(n)

Database need to care about constants

STATIC HASH TABLE

  • Assumption #1: Number of elements is known ahead of time and fixed.
  • Assumption #2: Each key is unique.
  • Assumption #3: Perfect hash function.
    • If key1 ≠ key2, then hash(key1) ≠ hash(key2)

We do not want to use a cryptographic hash function for DBMS hash tables (e.g., SHA-2).

We want something that is fast and has a low collision rate.

STATIC HASHING SCHEMES

LINEAR PROBE HASHING

  • Single giant table of slots.
  • Resolve collisions by linearly searching for the next free slot in the table.
    • To determine whether an element is present, hash to a location in the index and scan for it.
    • Must store the key in the index to know when to stop scanning.
    • Insertions and deletions are generalizations of lookups.

INSERT

alt text alt text alt text alt text alt text alt text

DELETE

We use "Tombstone"

  • Set a marker to indicate that the entry in the slot is logically deleted.
  • You can reuse the slot for new keys.
  • May need periodic garbage collection.

alt text alt text alt text alt text

NON-UNIQUE KEYS

alt text

ROBIN HOOD HASHING

Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys.

  • Each key tracks the number of positions they are from where its optimal position in the table.
  • On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key.
Note

A beats B if and only if A[Num] > B[Num]

alt text alt text alt text alt text alt text alt text alt text alt text

CUCKOO HASHING

Use multiple hash tables with different hash function seeds.

  • On insert, check every table and pick anyone that has a free slot.
  • If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text

Thinking

The previous hash tables require the DBMS to know the number of elements it wants to store.

Dynamic hash tables resize themselves on demand.

  • Chained Hashing
  • Extendible Hashing
  • Linear Hashing

DYNAMIC HASHING SCHEMES

CHAINED HASHING

Maintain a linked list of buckets for each slot in the hash table.

Resolve collisions by placing all elements with the same hash key into the same bucket.

  • To determine whether an element is present, hash to its bucket and scan for it.
  • Insertions and deletions are generalizations of lookups.

alt text

alt text

alt text

alt text

alt text

alt text

EXTENDIBLE HASHING

Chained-hashing approach where we split buckets instead of letting the linked list grow forever.

Multiple slot locations can point to the same bucket chain.

Reshuffle bucket entries on split and increase the number of bits to examine. - Data movement is localized to just the split chain.

alt text alt text alt text alt text alt text alt text alt text alt text

LINEAR HASHING

alt text

  • When any bucket overflows, split the bucket at the pointer location by adding a new slot entry, and create a new hash function.
  • If the hash function maps to slot that has previously been pointed to by pointer, apply the new hash function.
  • When the pointer reaches last slot, delete original hash function and replace it with a new hash function.

INSERT

alt text alt text alt text alt text alt text

Since ori_map(20) is 0, which is above the Split-Pointer-Line, we use new_map(20) = 4.

alt text alt text

Since ori_map(9) is 1, which is below the Split-Pointer-Line, we use it.

alt text

DELETE

Now we stand in the position above, and we will explain the exact process in face of it.

alt text alt text alt text alt text alt text alt text alt text

CONCLUSION

Fast data structures that support O(1) look-ups that are used all throughout DBMS internals.

Trade-off between speed and flexibility.

Hash tables are usually for a table index…

We usually use B+ Tree instead, which is organized to be introduced next class.