Chapter 15 Concurrency Control Theory¶
A DBMS's concurrency control and recovery components permeate(贯穿) throughout the design of its entire architecture.
TRANSACTIONS¶
A transaction is the execution of a sequence of one or more operations (e.g., SQL queries) on a database to perform some higher-level function.
For example, "fetch $100 from A bank and deposit it into B bank" is a transaction.
It is the basic unit of change in a DBMS:
- Partial transactions are not allowed!
For example, just "fetch $100 from A bank" or "deposit it into B bank" is a partial transaction and it's forbidden.
STRAWMAN SYSTEM¶
Execute each txn one-by-one (i.e., serial order) as they arrive at the DBMS.
- One and only one txn can be running at the same time in the DBMS.
Before a txn starts, copy the entire database to a new file (tmp_file) and make all changes to that file (tmp_file).
- If the txn completes successfully, overwrite the original file with the new one (tmp_file replaces ori_file).
- If the txn fails, just remove the dirty copy (still ori_file).
A (potentially) better approach is to allow concurrent execution of independent transactions.
Arbitrary(任意的) interleaving(交错) of operations can lead to:
- Temporary Inconsistency (ok, unavoidable)
- Permanent Inconsistency (bad!)
We need formal correctness criteria to determine whether an interleaving is valid.
在并发操作中,操作的任意交错可能会导致临时不一致(这是不可避免的,可以接受),但如果不加以控制,可能会导致永久性不一致(这是非常糟糕的情况)。为了避免这种情况,我们需要制定正式的正确性标准来判断操作的交错是否有效。
DEFINITIONS¶
A txn may carry out many operations on the data retrieved from the database
The DBMS is only concerned about what data is read/written from/to the database.
- Changes to the "outside world" are beyond the scope of the DBMS.
FORMAL DEFINITIONS¶
Database: A fixed set of named data objects (e.g., A, B, C, ......)
- We do not need to define what these objects are now.
- We will discuss how to handle inserts/deletes next week.
Transaction: A sequence of read and write operations (R(A), W(B), ......)
- DBMS's abstract view of a user program
TRANSACTIONS IN SQL¶
- A new txn starts with the
BEGIN
command. - The txn stops with either
COMMIT
orABORT
:- If commit, the DBMS either saves all the txn's changes or aborts it.
- If abort, all changes are undone so that it's like as if the txn never executed at all.
Abort can be either self-inflicted(自身造成的) or caused by the DBMS.
CORRECTNESS CRITERIA: ACID¶
这些概念是数据库事务(transactions)的ACID特性,描述了事务处理的四个基本属性:原子性、一致性、隔离性和持久性。
这些特性共同保证了数据库在并发和故障情况下的数据完整性和可靠性。
I believe these concepts are of great significance in this chapter!
Atomicity¶
All actions in txn happen, or none happen.
"All or nothing…"
事务中的所有操作要么全部执行,要么全部不执行。不会出现只执行部分操作的情况。
这意味着事务中的操作要么全部成功,要么全部失败。如果在执行过程中出现错误,事务会回滚,恢复到事务开始前的状态。
Consistency¶
If each txn is consistent and the DB starts consistent, then it ends up consistent.
"It looks correct to me…"
这意味着事务在执行过程中保持数据的完整性和约束。如果每个事务都能维护数据库的一致性,那么整个数据库系统将始终处于一致状态。
Isolation¶
Execution of one txn is isolated from that of other txns.
"All by myself…"
这意味着一个事务的中间状态对其他事务是不可见的。事务之间是独立的,保证了并发事务的正确执行。
Durability¶
If a txn commits, its effects persist.
"I will survive…"
一旦事务提交,它的结果是永久性的,不会因为系统故障而丢失。
这意味着事务一旦提交,其改变将永久保存在数据库中。即使系统崩溃或重启,已提交的事务的结果也不会丢失。
ATOMICITY OF TRANSACTIONS¶
Two possible outcomes of executing a txn:
- Commit after completing all its actions.
- Abort (or be aborted by the DBMS) after executing some actions.
DBMS guarantees that txns are atomic.
- From user's point of view: txn always either executes all its actions or executes no actions at all.
Approach #1: Logging¶
- DBMS logs all actions so that it can undo the actions of aborted transactions.
- Maintain undo records both in memory and on disk.
- Think of this like the black box in airplanes...
Logging is used by almost every DBMS.
- Audit Trail
- Efficiency Reasons
Approach #2: Shadow Paging¶
- DBMS makes copies of pages and txns make changes to those copies. Only when the txn commits is the page made visible to others.
- Originally from IBM System R.
Few systems do this, because it wastes:
- CouchDB
- Tokyo Cabinet
- LMDB (OpenLDAP)
CONSISTENCY¶
The "world" represented by the database is logically correct. All questions asked about the data are given logically correct answers.
DATABASE CONSISTENCY¶
The database accurately models the real world and follows integrity constraints.
Transactions in the future see the effects of transactions committed in the past inside of the database.
TRANSACTION CONSISTENCY¶
If the database is consistent before the transaction starts (running alone), it will also be consistent after.
Transaction consistency is the application's responsibility. DBMS cannot control this.
- We won't discuss this issue further…
MECHANISMS FOR ENSURING ISOLATION¶
A concurrency control protocol is how the DBMS decides the proper interleaving(交错) of operations from multiple transactions.
Two categories of protocols:
- Pessimistic: Don't let problems arise in the first place.
- Optimistic: Assume conflicts are rare, deal with them after they happen.
There is no guarantee that \(T_1\) will execute before \(T_2\) or vice-versa, if both are submitted together. But the net effect (最终成效) must be equivalent to these two transactions running serially(连续地) in some order.
INTERLEAVING TRANSACTIONS¶
We interleave txns to maximize concurrency.
- Slow disk/network I/O.
- Multi-core CPUs.
When one txn(资源/任务) stalls(停滞) because of a resource (e.g., page fault), another txn can continue executing and make forward progress.
EXAMPLE (GOOD)¶
EXAMPLE (BAD)¶
How do we judge whether a schedule is correct
If the schedule is equivalent to some serial execution.
FORMAL PROPERTIES OF SCHEDULES¶
Serial Schedule(串行调度)
- A schedule that does not interleave the actions of different transactions.
Equivalent Schedules(等价调度)
- For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule.
- Doesn't matter what the arithmetic operations are!
Note
等价调度意味着虽然两个调度中的操作可能以不同的顺序交错执行,但它们的最终结果是相同的。换句话说,不管事务的操作如何排列,只要执行结果一致,这两个调度就是等价的
Serializable Schedule(可串行化调度)
- A schedule that is equivalent to some serial execution of the transactions.
- If each transaction preserves consistency, every serializable schedule preserves consistency.
Serializability is a less intuitive notion of correctness compared to txn initiation time or commit order, but it provides the DBMS with more flexibility in scheduling operations. (More flexibility means better parallelism.)
CONFLICTING OPERATIONS¶
We need a formal notion of equivalence that can be implemented efficiently based on the notion of "conflicting" operations.
Two operations conflict if:
- They are by different transactions,
- They are on the same object and one of them is a write.
Interleaved Execution Anomalies
- Read-Write Conflicts (R-W)
- Write-Read Conflicts (W-R)
- Write-Write Conflicts (W-W)
READ-WRITE CONFLICTS¶
Unrepeatable Read: Txn gets different values when reading the same object multiple times.
WRITE-READ CONFLICTS¶
Dirty Read: One txn reads data written by another txn that has not committed yet.
WRITE-WRITE CONFLICTS¶
Lost Update: One txn overwrites uncommitted data from another uncommitted txn.
CONFLICT SERIALIZABILITY INTUITION¶
这里介绍一种很简单且直观的避免冲突的设计(事实上,我们用的并不多,因为它实在是太简单了...)
但是像下面这样的显然是不行的:
SERIALIZABILITY¶
Swapping operations is easy when there are only two txns in the schedule. It's cumbersome(繁琐的) when there are many txns.
Are there any faster algorithms to figure this out other than transposing operations?
DEPENDENCY GRAPHS¶
Dependency Graphs Details
What do we need to care about when designing a dependency graph?
- R and R - It's ok, don't need to care.
- W and R - Need to care.
- W and W - Need to care.
Examples¶
VIEW SERIALIZABILITY¶
Alternative (broader) notion of serializability.
Schedules \(S_1\) and \(S_2\) are view equivalent if:
- If \(T_1\) reads initial value of A in \(S_1\) , then \(T_1\) also reads initial value of A in \(S_2\).
- If \(T_1\) reads value of A written by \(T_2\) in \(S_1\) , then \(T_1\) also reads value of A written by \(T_2\) in \(S_2\).
- If \(T_1\) writes final value of A in \(S_1\) , then \(T_1\) also writes final value of A in \(S_2\).
视图可串行化 (View Serializability)
定义:
- 视图可串行化允许更多的调度,相较于冲突可串行化。这是因为视图可串行化的定义更宽松,只要调度的视图等价于某个串行调度,就认为是可串行化的。
解释:
- 视图可串行化考虑了事务的最终结果是否与某个串行调度的结果相同,而不是关心操作的具体顺序。它关注的是读取和写入数据的最终一致性。
- 然而,尽管视图可串行化允许更多的调度,但是在实际系统中很难有效地强制执行。因为判断一个调度是否视图可串行化需要复杂的计算,涉及到对所有可能的读写操作进行分析。
冲突可串行化 (Conflict Serializability)
定义:
- 冲突可串行化是一种更严格的可串行化标准,只允许那些调度中的操作没有冲突的调度。
解释:
- 冲突可串行化的定义依赖于事务间操作的冲突。两个操作冲突,如果它们访问相同的数据,并且至少有一个是写操作。
- 这种方法更容易实现,因为可以通过检测和避免冲突来确保事务的可串行化。
共同的限制
- 无论是视图可串行化还是冲突可串行化,都不能涵盖所有你可能认为是“可串行化”的调度。这是因为这两种定义都没有理解操作或数据的实际含义。
In practice, Conflict Serializability is what systems support because it can be enforced efficiently.
To allow more concurrency, some special cases get handled separately at the application level.
TRANSACTION DURABILITY¶
All the changes of committed transactions should be persistent.
- No torn(中断) updates.
- No changes from failed transactions.
The DBMS can use either logging or shadow paging to ensure that all changes are durable.