事务

62.7K
0
0
最后修改于

在数据系统中,很多事情都可能出错:

  • 数据库软硬件可能在任何时候发生故障(包括在写入操作过程中)。
  • 应用程序可能在任何时候崩溃(包括在一系列操作中途)。
  • 网络中断可能意外地切断应用程序与数据库的连接,或一个数据库节点与另一个节点之间的连接。
  • 多客户端可能同时写入数据库,导致并发冲突。
  • 客户端可能读取到中间态数据(只被部分更新)。
  • 客户端的竞态条件可能导致错误。
    In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works.
    为了可靠,系统必须处理这些故障,并确保它们不会导致整个系统崩溃。但容错机制的实现很复杂,需要仔细考虑所有可能出错的情况,并进行大量测试以确保解决方案确实有效。

For decades, transactions have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason).
几十年来,事务一直是简化这些问题的首选机制。事务是一种让应用程序将多个读写操作组合成一个逻辑单元的方式。从概念上讲,事务中的所有读和写操作都视为一个原子操作执行:要么整个事务都成功(提交),要么都失败(中止,回滚)。如果失败了,应用可以安全重试。有了事务,错误处理对应用程序来说变得简单得多,因为它不需要担心部分失败 —— 即某些操作成功而另一些操作失败(无论什么原因)。

If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them for granted. Transactions are not a law of nature; they were created with a purpose, namely to simplify the programming model for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these safety guarantees).

如果你多年来一直使用事务,它们可能看起来很明显,但我们不应该对此视而不见。事务并非自然法则;它们是为了一个目的而创造的,即简化访问数据库应用的编程模型。通过使用事务,应用程序可以自由地忽略某些潜在的错误场景和并发问题,因为数据库会替它们处理这些问题(我们称之为安全保证)。

Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability). Some safety properties can be achieved without transactions. On the other hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post Office Horizon scandal (see “How Important Is Reliability?”) was probably a lack of ACID transactions in the underlying accounting system [1].
并非所有应用都需要事务,有时削弱事务性保证或完全放弃事务是有优势的(例如,为了实现更高的性能或更高的可用性)。某些安全性属性可以在没有事务的情况下实现。另一方面,事务可以防止很多麻烦:例如,邮政总局 Horizon 丑闻背后的技术原因(参见 “可靠性有多重要?”)可能就是底层会计系统缺乏 ACID 事务。

How do you figure out whether you need transactions? In order to answer that question, we first need to understand exactly what safety guarantees transactions can provide, and what costs are associated with them. Although transactions seem straightforward at first glance, there are actually many subtle but important details that come into play.
如何判断是否需要事务?为了回答这个问题,我们首先需要准确理解事务可以提供哪些安全性保证,以及与它们相关的成本。虽然事务乍一看很简单,但实际上有很多微妙但重要的细节需要考虑。

In this chapter, we will examine many examples of things that can go wrong, and explore the algorithms that databases use to guard against those issues. We will go especially deep in the area of concurrency control, discussing various kinds of race conditions that can occur and how databases implement isolation levels such as read committed, snapshot isolation, and serializability.
在本章中,我们将探讨许多可能出错的情况,并研究数据库用来防范这些问题的算法。我们将特别深入并发控制领域,讨论可能出现的各种竞态条件,以及数据库如何实现诸如读已提交、快照隔离和可串行化等隔离级别。

Concurrency control is relevant for both single-node and distributed databases. Later in this chapter, in “Distributed Transactions”, we will examine the two-phase commit protocol and the challenge of achieving atomicity in a distributed transaction.
并发控制对单节点和分布式数据库都至关重要。本章稍后,在 “分布式事务” 部分,我们将探讨两阶段提交协议以及分布式事务中实现原子性的挑战。

What Exactly Is a Transaction? 事务究竟是什么?#

Almost all relational databases today, and some nonrelational databases, support transactions. Most of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [2,3,4]. Although some implementation details have changed, the general idea has remained virtually the same for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily similar to that of System R.
当今几乎所有关系型数据库,以及一些非关系型数据库,都支持事务。它们大多遵循 1975 年由 IBM System R(首个 SQL 数据库)引入的风格。尽管一些实现细节有所变化,但 50 年来基本理念几乎保持不变:MySQL、PostgreSQL、Oracle、SQL Server 等数据库的事务支持与 System R 惊人地相似。

In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to improve upon the relational status quo by offering a choice of new data models (see Chapter 3), and by including replication (Chapter 6) and sharding (Chapter 7) by default. Transactions were the main casualty of this movement: many of this generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood.
在 21 世纪初,非关系型(NoSQL)数据库开始流行起来。它们旨在通过提供新的数据模型选择(见第 3 章)以及默认包含复制(第 6 章)和分片(第 7 章)来改进关系型数据库的现状。事务是这个运动的主要受害者:这一代许多数据库完全放弃了事务,或者重新定义了这个术语,以描述一个比之前理解的要弱得多的保证集。

The hype around NoSQL distributed databases led to a popular belief that transactions were fundamentally unscalable, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability. More recently, that belief has turned out to be wrong. So-called “NewSQL” databases such as CockroachDB [5], TiDB [6], Spanner [7], FoundationDB [8], and Yugabyte have shown that transactional systems can scale to large data volumes and high throughput. These systems combine sharding with consensus protocols (Chapter 10) to provide strong ACID guarantees at scale.
围绕 NoSQL 分布式数据库的炒作导致了一种普遍的信念,即事务从根本上不可扩展,任何大规模系统都必须放弃事务才能保持良好的性能和高可用性。最近,这种信念已被证明是错误的。所谓的 “新 SQL” 数据库,如 CockroachDB [5]、TiDB [6]、Spanner [7]、FoundationDB [8] 和 Yugabyte,已经证明事务性系统可以扩展到大量数据和高吞吐量。这些系统结合了分片和共识协议(第 10 章)来在大规模下提供强 ACID 保证。

However, that doesn’t mean that every system must be transactional either: like every other technical design choice, transactions have advantages and limitations. In order to understand those trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal operation and in various extreme (but realistic) circumstances.
然而,这并不意味着每个系统都必须是事务性的:和所有其他技术设计选择一样,事务有其优势和局限性。为了理解这些权衡,让我们深入探讨事务可以提供的保证 —— 无论是在正常操作中还是在各种极端(但现实)的情况下。

The Meaning of ACIDACID 的意义#

The safety guarantees provided by transactions are often described by the well-known acronym ACID, which stands for Atomicity, Consistency, Isolation, and Durability. It was coined in 1983 by Theo Härder and Andreas Reuter [9] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
事务提供的安全保证通常用众所周知的缩写 ACID 来描述,它代表原子性、一致性、隔离性和持久性。这个术语由 Theo Härder 和 Andreas Reuter 于 1983 年提出 [9],旨在为数据库中的容错机制建立精确的术语。

However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of isolation [10]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.
然而,在实践中,一个数据库对 ACID 的实现并不等同于另一个数据库的实现。例如,正如我们将看到的,关于隔离性的含义存在很多模糊之处。高级概念是合理的,但关键在于细节。如今,当一个系统声称 “符合 ACID 标准” 时,你实际上可以期待什么样的保证尚不明确。不幸的是,ACID 已经基本上变成了一个营销术语。

(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency [11]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.)
(不满足 ACID 标准的系统有时被称为 BASE,BASE 代表基本可用、软状态和最终一致性 [11]。这个定义比 ACID 的定义更加模糊。看来 BASE 唯一合理的定义就是 “不是 ACID”;也就是说,它可以代表你想要表达几乎任何意思。)

Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let us refine our idea of transactions.
让我们深入探讨原子性、一致性、隔离性和持久性的定义,这将让我们完善对事务的理解。

Atomicity 原子性#

In general, atomic refers to something that cannot be broken down into smaller parts. The word means similar but subtly different things in different branches of computing. For example, in multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.
通常来说,原子指的是不可再分解成更小部分的事物。在不同的计算领域中,这个词的意思相似但略有不同。例如,在多线程编程中,如果一个线程执行了原子操作,这意味着另一个线程无法看到该操作未完成的结果。系统只能处于操作前的状态或操作后的状态,而不能处于两者之间的状态。

By contrast, in the context of ACID, atomicity is not about concurrency. It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I, for isolation (see “Isolation”).
相比之下,在 ACID 的上下文中,原子性并不涉及并发。它不描述多个进程同时尝试访问相同数据时会发生什么,因为这一点在隔离(I,见 “隔离”)中已有涵盖。

Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated.If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.
相反,ACID 的原子性描述了当客户端想要进行多次写入,但在部分写入处理完成后发生故障时会发生什么 —— 例如,进程崩溃、网络连接中断、磁盘满或违反某些完整性约束。如果将写入分组到一个原子事务中,并且由于故障事务无法完成(提交),那么事务将被中止,数据库必须丢弃或撤销在该事务中已进行的任何写入。

Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to know which changes have taken effect and which haven’t. The application could try again, but that risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies this problem: if a transaction was aborted, the application can be sure that it didn’t change anything, so it can safely be retried.
如果没有原子性,如果在进行多次更改的过程中发生错误,很难知道哪些更改已经生效,哪些尚未生效。应用程序可以尝试再次操作,但这有两次进行相同更改的风险,导致数据重复或不正确。原子性简化了这个问题:如果事务被中止,应用程序可以确信它没有进行任何更改,因此可以安全地重试。

The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity. Perhaps abortability would have been a better term than atomicity, but we will stick with atomicity since that’s the usual word.
能够在出错时中止事务并丢弃该事务的所有写入操作是 ACID 原子性的定义性特征。或许用 “可中止性” 比 “原子性” 更合适,但我们还是坚持使用 “原子性”,因为这是通常的说法。

Consistency 一致性#

The word consistency is terribly overloaded:
“一致性” 这个词用得太泛滥了:

  • In Chapter 6 we discussed replica consistency and the issue of eventual consistency that arises in asynchronously replicated systems (see “Problems with Replication Lag”).
    在第六章中,我们讨论了副本一致性以及异步复制系统(参见 “复制延迟的问题”)中出现的最终一致性问题。
  • A consistent snapshot of a database, e.g. for a backup, is a snapshot of the entire database as it existed at one moment in time. More precisely, it is consistent with the happens-before relation (see “The “happens-before” relation and concurrency”): that is, if the snapshot contains a value that was written at a particular time, then it also reflects all the writes that happened before that value was written.
    数据库的一致性快照,例如用于备份,是数据库在某一时刻存在状态的全局快照。更精确地说,它与 “发生之前” 关系(参见 “发生之前关系与并发”)保持一致:也就是说,如果快照中包含某个特定时间写入的值,那么它也会反映在该值写入之前所发生的所有写入。
  • Consistent hashing is an approach to sharding that some systems use for rebalancing (see “Consistent hashing”).
    一致性哈希是一种用于分片并在某些系统中进行再平衡的方法(参见 “一致性哈希”)。
  • In the CAP theorem (see Chapter 10), the word consistency is used to mean linearizability (see “Linearizability”).
    在 CAP 定理(参见第 10 章)中,一致性一词指的是线性化(参见 “线性化”)。
  • In the context of ACID, consistency refers to an application-specific notion of the database being in a “good state.”
    在 ACID 的上下文中,一致性是指数据库处于 “良好状态” 的应用特定概念。

It’s unfortunate that the same word is used with at least five different meanings.
同一个词至少有五种不同含义,这很不幸。

The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied. (An invariant may be temporarily violated during transaction execution, but it should be satisfied again at transaction commit.)

ACID 一致性理念在于,你有一系列关于数据的声明(不变式)必须始终为真 —— 例如,在一个会计系统中,所有账户的贷方和借方必须始终平衡。如果事务开始时数据库符合这些不变式,并且事务中的任何写入都保持其有效性,那么你可以确信这些不变式始终得到满足。(不变式在事务执行期间可能会暂时被违反,但在事务提交时应再次满足。)

If you want the database to enforce your invariants, you need to declare them as constraints as part of the schema. For example, foreign key constraints, uniqueness constraints, or check constraints (which restrict the values that can appear in an individual row) are often used to model specific types of invariants. More complex consistency requirements can sometimes be modeled using triggers or materialized views [12].
如果你希望数据库强制执行你的不变式,需要将它们作为约束的一部分声明在模式中。例如,外键约束、唯一性约束或检查约束(限制单个行中可以出现的值)通常用于建模特定类型的不变式。更复杂的一致性要求有时可以使用触发器或物化视图进行建模。

However, complex invariants can be difficult or impossible to model using the constraints that databases usually provide. In that case, it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. If you write bad data that violates your invariants, but you haven’t declared those invariants, the database can’t stop you. As such, the C in ACID often depends on how the application uses the database, and it’s not a property of the database alone.
然而,复杂的约束可能难以或无法使用数据库通常提供的约束来建模。在这种情况下,应用有责任正确定义其事务,以确保它们保持一致性。如果你写入违反约束的坏数据,但你没有声明这些约束,数据库就无法阻止你。因此,ACID 中的 C(一致性)往往取决于应用程序如何使用数据库,而不仅仅是数据库本身的属性。

Isolation 隔离#

Most databases are accessed by several clients at the same time. That is no problem if they are reading and writing different parts of the database, but if they are accessing the same database records, you can run into concurrency problems (race conditions).

大多数数据库同时被多个客户端访问。它们访问相同的数据库记录,可能会遇到并发问题 / 竞态条件

Figure 8-1 is a simple example of this kind of problem. Say you have two clients simultaneously incrementing a counter that is stored in a database. Each client needs to read the current value, add 1, and write the new value back (assuming there is no increment operation built into the database). In Figure 8-1 the counter should have increased from 42 to 44, because two increments happened, but it actually only went to 43 because of the race condition.

图 8-1 是这类问题的一个简单示例。假设你有两个客户端同时递增一个存储在数据库中的计数器。每个客户端需要读取当前值,加 1,然后将新值写回(假设数据库没有内置递增操作)。在图 8-1 中,计数器应该从 42 增加到 44,因为发生了两次递增,但实际上由于竞态条件只增加到 43。

ddia 0801

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently [13].
ACID 意义上的隔离是指并发执行的事务彼此隔离:它们不会互相干扰。经典数据库教科书将隔离形式化为可串行化,这意味着每个事务可以假装自己是整个数据库上唯一运行的事务。数据库确保当事务提交时,结果与它们串行运行时相同,即使实际上它们可能是并发运行的。

However, serializability has a performance cost. In practice, many databases use forms of isolation that are weaker than serializability: that is, they allow concurrent transactions to interfere with each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle has an isolation level called “serializable,” but it actually implements snapshot isolation, which is a weaker guarantee than serializability [10,14]). This means that some kinds of race conditions can still occur. We will explore snapshot isolation and other forms of isolation in “Weak Isolation Levels”.
然而,可串行化会带来性能成本。在实践中,许多数据库使用比可串行化更弱的隔离级别:也就是说,它们允许并发事务以有限的方式相互干扰。一些流行的数据库,如 Oracle,甚至没有实现它(Oracle 有一个名为 “可串行化” 的隔离级别,但实际上它实现的是快照隔离,这比可串行化提供的保证要弱)。这意味着某些类型的竞态条件仍然可能发生。我们将在 “弱隔离级别” 中探讨快照隔离和其他形式的隔离。

Durability 持久性#

The purpose of a database system is to provide a safe place where data can be stored without fear of losing it. Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.
数据库系统的目的是提供一个安全的地方,可以存储数据而不必担心丢失。持久性是指一旦事务成功提交,它所写入的任何数据都不会被遗忘,即使发生硬件故障或数据库崩溃也是如此。

In a single-node database, durability typically means that the data has been written to nonvolatile storage such as a hard drive or SSD. Regular file writes are usually buffered in memory before being sent to the disk sometime later, which means they would be lost if there is a sudden power failure; many databases therefore use the fsync() system call to ensure the data really has been written to disk. Databases usually also have a write-ahead log or similar (see “Making B-trees reliable”), which allows them to recover in the event that a crash occurs part way through a write.
在单节点数据库中,持久性通常意味着数据已被写入非易失性存储,例如硬盘或 SSD。常规文件写入通常会在稍后发送到磁盘之前在内存中进行缓冲,这意味着如果发生突然断电,这些写入将会丢失;因此,许多数据库使用 fsync() 系统调用来确保数据确实已被写入磁盘。数据库通常还具有预写日志或类似机制(参见 “使 B 树可靠”),这允许它们在写入过程中发生崩溃时恢复。

In a replicated database, durability may mean that the data has been successfully copied to some number of nodes. In order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed. However, as discussed in “Reliability and Fault Tolerance”, perfect durability does not exist: if all your hard disks and all your backups are destroyed at the same time, there’s obviously nothing your database can do to save you.
在一个多副本的数据库中,持久性可能意味着数据已经被成功复制到一些节点上。为了提供持久性保证,数据库必须等待这些写入或复制操作完成后再报告事务成功提交。然而,正如在 “可靠性和容错性” 中讨论的,完美的持久性并不存在:如果你的所有硬盘和所有备份在同一时间被摧毁,显然你的数据库无法挽救你。

Single-Object and Multi-Object Operations 单对象和多对象操作#

To recap, in ACID, atomicity and isolation describe what the database should do if a client makes several writes within the same transaction:
总结来说,在 ACID 中,原子性和隔离性描述了当客户端在同一个事务中执行多个写入操作时,数据库应该如何处理:

Atomicity 原子性

If an error occurs halfway through a sequence of writes, the transaction should be aborted, and the writes made up to that point should be discarded. In other words, the database saves you from having to worry about partial failure, by giving an all-or-nothing guarantee.
如果在写入序列中途发生错误,事务应该被中止,并且到那个时间点为止的写入操作应该被丢弃。换句话说,数据库通过提供全有或全无的保证,让你不必担心部分失败。

Isolation 隔离性

Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.
并发运行的事务不应相互干扰。例如,如果一个事务执行多个写入操作,那么另一个事务应该看到全部或全部不看到这些写入操作,但不会看到部分写入操作。

These definitions assume that you want to modify several objects (rows, documents, records) at once. Such multi-object transactions are often needed if several pieces of data need to be kept in sync.Figure 8-2 shows an example from an email application. To display the number of unread messages for a user, you could query something like:
这些定义假设你希望同时修改多个对象(行、文档、记录)。如果需要将多个数据项保持同步,则经常需要此类多对象交易。图 8-2 展示了一个电子邮件应用程序的示例。为了显示某个用户未读消息的数量,你可以查询类似的内容:

ddia 0802

However, you might find this query to be too slow if there are many emails, and decide to store the number of unread messages in a separate field (a kind of denormalization, which we discuss in “Normalization, Denormalization, and Joins”). Now, whenever a new message comes in, you have to increment the unread counter as well, and whenever a message is marked as read, you also have to decrement the unread counter.
然而,如果你发现这个查询在邮件很多时太慢,可能会决定将未读消息的数量存储在一个单独的字段中(这是一种反规范化,我们将在 “规范化、反规范化和连接” 中讨论)。现在,每当有新消息进来时,你必须增加未读计数器,每当消息被标记为已读时,你也必须减少未读计数器。

In Figure 8-2, user 2 experiences an anomaly: the mailbox listing shows an unread message, but the counter shows zero unread messages because the counter increment has not yet happened. (If an incorrect counter in an email application seems too insignificant, think of a customer account balance instead of an unread counter, and a payment transaction instead of an email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the inserted email and the updated counter, or neither, but not an inconsistent halfway point.
在图 8-2 中,用户 2 经历了一个异常:邮箱列表显示有一条未读消息,但计数器显示零条未读消息,因为计数器增加还没有发生。(如果电子邮件应用中的错误计数器看起来微不足道,不妨想想客户账户余额而不是未读计数器,以及支付交易而不是电子邮件。)隔离性通过确保用户 2 要么看到插入的邮件和更新的计数器,要么什么也看不到,但不会看到不一致的中间状态,从而防止了这个问题。

Figure 8-3 illustrates the need for atomicity: if an error occurs somewhere over the course of the transaction, the contents of the mailbox and the unread counter might become out of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted and the inserted email is rolled back.
图 8-3 说明了原子性的必要性:如果在事务过程中发生错误,邮箱内容和未读计数器可能会变得不同步。在一个原子事务中,如果计数器的更新失败,事务将被中止,插入的邮件将被回滚。

ddia 0803

Multi-object transactions require some way of determining which read and write operations belong to the same transaction. In relational databases, that is typically done based on the client’s TCP connection to the database server: on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction. If the TCP connection is interrupted, the transaction must be aborted.
多对象事务需要某种方式来确定哪些读和写操作属于同一事务。在关系型数据库中,这通常是基于客户端到数据库服务器的 TCP 连接来完成的:在任何特定连接中,一个 BEGIN TRANSACTION 和一个 COMMIT 语句之间的所有内容都被认为是同一事务的一部分。如果 TCP 连接中断,事务必须被中止。

On the other hand, many nonrelational databases don’t have such a way of grouping operations together. Even if there is a multi-object API (for example, a key-value store may have a multi-put operation that updates several keys in one operation), that doesn’t necessarily mean it has transaction semantics: the command may succeed for some keys and fail for others, leaving the database in a partially updated state.
另一方面,许多非关系型数据库没有将操作组合在一起的方法。即使存在多对象 API(例如,键值存储可能有多个 put 操作,一次操作中更新多个键),这并不意味着它具有事务语义:命令可能对某些键成功,而对其他键失败,导致数据库处于部分更新状态。

Single-object writes 单对象写入#

Atomicity and isolation also apply when a single object is being changed. For example, imagine you are writing a 20 KB JSON document to a database:
原子性和隔离性也适用于单个对象被修改的情况。例如,假设你正在将一个 20 KB 的 JSON 文档写入数据库:

  • If the network connection is interrupted after the first 10 KB have been sent, does the database store that unparseable 10 KB fragment of JSON?
    如果在发送了前 10 KB 后网络连接中断,数据库是否会存储那不可解析的 10 KB JSON 片段?
  • If the power fails while the database is in the middle of overwriting the previous value on disk, do you end up with the old and new values spliced together?
    如果在数据库正在将旧值覆盖到磁盘的过程中断电,最终是否会得到新旧值拼接在一起的结果?
  • If another client reads that document while the write is in progress, will it see a partially updated value?
    如果另一个客户端在写入过程中读取该文档,它会看到部分更新的值吗?

Those issues would be incredibly confusing, so storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node. Atomicity can be implemented using a log for crash recovery (see “Making B-trees reliable”), and isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).
这些问题会极其令人困惑,因此存储引擎几乎普遍致力于在单个节点上的单个对象(如键值对)级别提供原子性和隔离性。原子性可以通过使用用于崩溃恢复的日志来实现(参见 “使 B 树可靠”),而隔离性可以通过在每个对象上使用锁来实现(允许同一时间只有一个线程访问对象)。

Some databases also provide more complex atomic operations, such as an increment operation, which removes the need for a read-modify-write cycle like that in Figure 8-1. Similarly popular is a conditional write operation, which allows a write to happen only if the value has not been concurrently changed by someone else (see “Conditional writes (compare-and-set)”), similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
一些数据库还提供更复杂的原子操作,例如增量操作,它无需像图 8-1 中那样的读 - 修改 - 写循环。同样流行的是条件写入操作,它允许只有在值未被其他人并发更改时才发生写入(参见 “条件写入(比较并设置)”,类似于共享内存并发中的比较并设置或比较并交换(CAS)操作)。

Note 注意#

Strictly speaking, the term atomic increment uses the word atomic in the sense of multi-threaded programming. In the context of ACID, it should actually be called an isolated or serializable increment, but that’s not the usual term.
严格来说,术语原子增量使用的是多线程编程中原子一词的含义。在 ACID 的上下文中,它实际上应该被称为隔离或可序列化增量,但这并非常用术语。

These single-object operations are useful, as they can prevent lost updates when several clients try to write to the same object concurrently (see “Preventing Lost Updates”). However, they are not transactions in the usual sense of the word. For example, the “lightweight transactions” feature of Cassandra and ScyllaDB, and Aerospike’s “strong consistency” mode offer linearizable (see “Linearizability”) reads and conditional writes on a single object, but no guarantees across multiple objects.
这些单对象操作很有用,因为它们可以防止多个客户端并发写入相同对象时出现更新丢失(参见 “防止更新丢失”)。然而,它们并非通常意义上的事务。例如,Cassandra 和 ScyllaDB 的 “轻量级事务” 功能,以及 Aerospike 的 “强一致性” 模式,在单个对象上提供线性化(参见 “线性化”)的读取和条件写入,但无法跨多个对象提供保证。

The need for multi-object transactions 多对象事务的需求#

Do we need multi-object transactions at all? Would it be possible to implement any application with only a key-value data model and single-object operations?
我们是否真的需要多对象事务?是否有可能仅使用键值数据模型和单个对象操作来实现任何应用程序?

There are some use cases in which single-object inserts, updates, and deletes are sufficient. However, in many other cases writes to several different objects need to be coordinated:
有些用例中,单个对象的插入、更新和删除就足够了。然而,在许多其他情况下,需要协调对多个不同对象的写入:

  • In a relational data model, a row in one table often has a foreign key reference to a row in another table. Similarly, in a graph-like data model, a vertex has edges to other vertices. Multi-object transactions allow you to ensure that these references remain valid: when inserting several records that refer to one another, the foreign keys have to be correct and up to date, or the data becomes nonsensical.
    在一个关系型数据模型中,一个表中的一行通常会引用另一个表中的行作为外键。类似地,在图状数据模型中,一个顶点会通过边连接到其他顶点。多对象事务允许你确保这些引用保持有效:当插入多个相互引用的记录时,外键必须正确且是最新的,否则数据会变得没有意义。
  • In a document data model, the fields that need to be updated together are often within the same document, which is treated as a single object—no multi-object transactions are needed when updating a single document. However, document databases lacking join functionality also encourage denormalization (see “When to Use Which Model”). When denormalized information needs to be updated, like in the example of Figure 8-2, you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.
    在文档数据模型中,需要一起更新的字段通常位于同一文档内,该文档被视为一个单一对象 —— 更新单个文档时无需多对象事务。然而,缺乏连接功能的文档数据库也会鼓励反规范化(参见 “何时使用何种模型”)。当反规范化信息需要更新时,例如图 8-2 中的示例,你需要一次性更新多个文档。在这种情况下,事务非常有用,可以防止反规范化数据不同步。
  • In databases with secondary indexes (almost everything except pure key-value stores), the indexes also need to be updated every time you change a value. These indexes are different database objects from a transaction point of view: for example, without transaction isolation, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet (see “Sharding and Secondary Indexes”).
    在具有二级索引的数据库中(几乎所有非纯键值存储的数据库),每次更改值时,索引也需要更新。从事务的角度来看,这些索引是不同的数据库对象:例如,如果没有事务隔离,一条记录可能会出现在一个索引中而不会出现在另一个索引中,因为对第二个索引的更新尚未发生(参见 “分片与二级索引”)。

Such applications can still be implemented without transactions. However, error handling becomes much more complicated without atomicity, and the lack of isolation can cause concurrency problems. We will discuss those in “Weak Isolation Levels”, and explore alternative approaches in [Link to Come].
尽管这些应用程序仍然可以在没有事务的情况下实现。但是,没有原子性,错误处理会变得非常复杂,缺乏隔离性可能会导致并发问题。我们将在 “弱隔离级别” 中讨论这些问题,并在 [待定链接] 中探索替代方法。

Handling errors and aborts 处理错误和中止#

A key feature of a transaction is that it can be aborted and safely retried if an error occurred. ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.
事务的一个关键特性是,如果发生错误,它可以被中止并安全地重试。ACID 数据库基于这一理念:如果数据库有违反其原子性、隔离性或持久性保证的危险,它宁愿完全放弃事务,也不允许事务保持半完成状态。

Not all systems follow that philosophy, though. In particular, datastores with leaderless replication (see “Leaderless Replication”) work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done”—so it’s the application’s responsibility to recover from errors.
并非所有系统都遵循这一理念。特别是,具有无主复制的数据库(参见 “无主复制”)更多地基于 “尽力而为” 的原则,这可以总结为 “数据库会尽力完成,如果遇到错误,它不会撤销已经完成的部分”—— 因此,恢复错误是应用程序的责任。

Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling. For example, popular object-relational mapping (ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message. This is a shame, because the whole point of aborts is to enable safe retries.
错误不可避免会发生,但许多软件开发者更倾向于只考虑顺利路径,而不是处理错误的复杂性。例如,Rails 的 ActiveRecord 和 Django 等流行的对象关系映射(ORM)框架不会重试中止的事务 —— 错误通常会导致异常冒泡,因此任何用户输入都会被丢弃,用户会收到错误消息。这很遗憾,因为中止的整个目的就是为了实现安全的重试。

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect:
尽管重试中止的事务是一种简单有效的错误处理机制,但它并不完美:

  • If the transaction actually succeeded, but the network was interrupted while the server tried to acknowledge the successful commit to the client (so it timed out from the client’s point of view), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place.
    如果事务实际上已经成功,但在服务器尝试向客户端确认成功提交时网络中断(从客户端的角度来看超时了),那么重试事务会导致其被执行两次 —— 除非你有一个额外的应用级去重机制。
  • If the error is due to overload or high contention between concurrent transactions, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (see “When an overloaded system won’t recover”).
    如果错误是由于过载或并发事务之间的高竞争引起的,重试事务会使问题变得更糟而不是更好。为了避免这种反馈循环,你可以限制重试次数,使用指数退避,并将与过载相关的错误与其他错误区分处理(见 “当过载系统无法恢复时”)
  • It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.
    只有在出现瞬态错误时(例如由于死锁、隔离违规、临时网络中断和故障转移)才值得重试;对于永久性错误(例如约束违规),重试将毫无意义。
  • If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to send the email again every time you retry the transaction. If you want to make sure that several different systems either commit or abort together, two-phase commit can help (we will discuss this in “Two-Phase Commit (2PC)”).
    如果事务在数据库之外还有副作用,这些副作用即使事务中止也可能发生。例如,如果你正在发送电子邮件,你不会希望在每次重试事务时再次发送电子邮件。如果你希望确保多个不同的系统一起提交或中止,可以使用两阶段提交(我们将在 “两阶段提交(2PC)” 中讨论这一点)。
  • If the client process crashes while retrying, any data it was trying to write to the database is lost.
    如果客户端进程在重试时崩溃,它试图写入数据库的任何数据都将丢失。

Weak Isolation Levels 弱隔离级别#

If two transactions don’t access the same data, or if both are read-only, they can safely be run in parallel, because neither depends on the other. Concurrency issues (race conditions) only come into play when one transaction reads data that is concurrently modified by another transaction, or when the two transactions try to modify the same data.
如果两个事务不访问相同的数据,或者两者都是只读的,它们可以安全地并行运行,因为它们彼此不依赖。并发问题(竞态条件)只有在以下情况才会出现:一个事务读取由另一个事务并发修改的数据,或者两个事务尝试修改相同的数据。

Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to reproduce. Concurrency is also very difficult to reason about, especially in a large application where you don’t necessarily know which other pieces of code are accessing the database. Application development is difficult enough if you just have one user at a time; having many concurrent users makes it much harder still, because any piece of data could unexpectedly change at any time.
并发错误很难通过测试发现,因为这类错误只有在运气不好时才会触发。这类时间问题可能非常罕见,而且通常很难重现。并发也极难推理,特别是在大型应用程序中,你未必知道哪些其他代码正在访问数据库。如果只有一个用户,应用开发就已经足够困难;而多个并发用户会使问题更加复杂,因为任何数据都可能随时发生意外变化。

For that reason, databases have long tried to hide concurrency issues from application developers by providing transaction isolation. In theory, isolation should make your life easier by letting you pretend that no concurrency is happening: serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially (i.e., one at a time, without any concurrency).
因此,数据库长期以来试图通过提供事务隔离来隐藏并发问题,让应用开发者不必担心。理论上,隔离应该会让你的工作更简单,因为它允许你假装没有并发发生:可序列化隔离意味着数据库保证事务的效果等同于它们串行运行(即一次运行一个,没有任何并发)。

In practice, isolation is unfortunately not that simple. Serializable isolation has a performance cost, and many databases don’t want to pay that price [10]. It’s therefore common for systems to use weaker levels of isolation, which protect against some concurrency issues, but not all. Those levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are nevertheless used in practice [30].
在实践中,隔离性并不那么简单。可序列化隔离有性能成本,许多数据库都不想支付这个代价 [10]。因此,系统通常使用较弱级别的隔离,这些级别可以防止一些并发问题,但并非所有问题。这些隔离级别理解起来要困难得多,并且可能导致微妙错误,但它们在实践中仍然被使用 [30]。

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have caused substantial loss of money [31,32,33], led to investigation by financial auditors [34], and caused customer data to be corrupted [35]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!”—but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring.
由弱事务隔离引起的并发问题并非仅仅是理论问题。它们已经导致了重大的经济损失 [31, 32, 33],引发了金融审计人员的调查 [34],并导致客户数据损坏 [35]。对于此类问题的揭露,常见的评论是 “如果你处理的是金融数据,那就使用 ACID 数据库!”—— 但这并未抓住问题的核心。即使是许多流行的关系型数据库系统(通常被认为是 “ACID” 的),也使用弱隔离,因此它们未必能防止这些问题的发生。

Note 注意#

Incidentally, much of the banking system relies on text files that are exchanged via secure FTP [36]. In this context, having an audit trail and some human-level fraud prevention measures is actually more important than ACID properties.
顺便说一句,银行系统很大程度上依赖通过安全 FTP 交换的文本文件 [36]。在这种情况下,拥有审计追踪和一些人类级别的欺诈预防措施实际上比 ACID 属性更为重要。

Those examples also highlight an important point: even if concurrency issues are rare in normal operation, you have to consider the possibility that an attacker deliberately sends a burst of highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs [31]. Therefore, in order to build applications that are reliable and secure, you have to ensure that such bugs are systematically prevented.
这些例子也突显了一个重要观点:即使正常操作中并发问题很少出现,你也必须考虑攻击者可能故意发送大量高度并发请求到你的 API,试图故意利用并发漏洞 [31]。因此,为了构建可靠且安全的应用程序,你必须确保这类漏洞能被系统性地预防。

In this section we will look at several weak (nonserializable) isolation levels that are used in practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can decide what level is appropriate to your application. Once we’ve done that, we will discuss serializability in detail (see “Serializability”). Our discussion of isolation levels will be informal, using examples. If you want rigorous definitions and analyses of their properties, you can find them in the academic literature [37,38,39,40].
在本节中,我们将探讨实际应用中使用的几种弱(非可串行化)隔离级别,并详细讨论可能和不可能发生的竞态条件,以便你可以决定哪种级别适合你的应用。完成这些之后,我们将详细讨论可串行化(参见 “可串行化”)。我们对隔离级别的讨论将是非正式的,使用示例来说明。如果你需要严格的定义和对其特性的分析,可以在学术文献中找到 [37, 38, 39, 40]。

Read Committed 已提交#

The most basic level of transaction isolation is read committed. It makes two guarantees:
最基础的交易隔离级别是读已提交。它提供了两个保证:

  1. When reading from the database, you will only see data that has been committed (no dirty reads).
    当从数据库读取时,你只会看到已提交的数据(不会出现脏读)。
  2. When writing to the database, you will only overwrite data that has been committed (no dirty writes).
    当向数据库写入时,你只会覆盖已提交的数据(不会出现脏写)。

Some databases support an even weaker isolation level called read uncommitted. It prevents dirty writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail.
一些数据库支持一种更弱的隔离级别,称为读未提交。它防止脏写,但不会防止脏读。让我们更详细地讨论这两个保证。

No dirty reads 没有脏读#

Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. Can another transaction see that uncommitted data? If yes, that is called a dirty read [3].
想象一个交易已将一些数据写入数据库,但该交易尚未提交或中止。另一个交易能看到这些未提交的数据吗?如果可以,这就称为脏读 [3]。

Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then all of its writes become visible at once). This is illustrated in Figure 8-4, where user 1 has set x = 3, but user 2’s get x still returns the old value, 2, while user 1 has not yet committed.
运行在读已提交隔离级别的交易必须防止脏读。这意味着一个交易的任何写入只有在该交易提交时(然后所有写入内容会一次性全部可见)才对其他事务可见。图 8-4 说明了这一点,其中用户 1 将 x 设置为 3,但用户 2 的 get x 仍然返回旧值 2,而用户 1 尚未提交。

ddia 0804

There are a few reasons why it’s useful to prevent dirty reads:
防止脏读有几个原因:

  • If a transaction needs to update several rows, a dirty read means that another transaction may see some of the updates but not others. For example, in Figure 8-2, the user sees the new unread email but not the updated counter. This is a dirty read of the email. Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.
    如果一个事务需要更新多行,脏读意味着另一个事务可能看到部分更新但看不到其他更新。例如,在图 8-2 中,用户看到了未读的新邮件但未看到更新的计数器。这是对邮件的脏读。看到数据库处于部分更新状态会使用户感到困惑,并可能导致其他事务做出错误的决定。
  • If a transaction aborts, any writes it has made need to be rolled back (like in Figure 8-3). If the database allows dirty reads, that means a transaction may see data that is later rolled back—i.e., which is never actually committed to the database. Any transaction that read uncommitted data would also need to be aborted, leading to a problem called cascading aborts.
    如果事务中止,它所做的任何写入都需要回滚(如图 8-3 所示)。如果数据库允许脏读,这意味着一个事务可能会看到后来被回滚的数据 —— 即实际上从未提交到数据库的数据。任何读取未提交数据的交易也需要中止,从而导致级联中止的问题。

No dirty writes 无脏写#

What happens if two transactions concurrently try to update the same row in a database? We don’t know in which order the writes will happen, but we normally assume that the later write overwrites the earlier write.
如果两个事务同时尝试更新数据库中的同一行会发生什么?我们不知道写入的顺序,但通常假设后写入的内容会覆盖先前的写入。

However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a dirty write [37]. Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted.
然而,如果先前的写入是尚未提交的事务的一部分,导致后来的写入覆盖了未提交的值会怎样?这称为脏写 [37]。以读已提交隔离级别运行的事务必须防止脏写,通常通过将第二次写入延迟到第一次写入的事务提交或中止之后。

By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:
通过防止脏写,此隔离级别避免了某些并发问题:

  • If transactions update multiple rows, dirty writes can lead to a bad outcome. For example, consider Figure 8-5, which illustrates a used car sales website on which two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires two database writes: the listing on the website needs to be updated to reflect the buyer, and the sales invoice needs to be sent to the buyer. In the case of Figure 8-5, the sale is awarded to Bryce (because he performs the winning update to the listings table), but the invoice is sent to Aaliyah (because she performs the winning update to the invoices table). Read committed prevents such mishaps.
    如果事务更新多行,脏写可能导致不良后果。例如,考虑图 8-5,该图展示了一个二手车销售网站,其中 Aaliyah 和 Bryce 两人同时试图购买同一辆车。购买一辆车需要两次数据库写入:网站上的列表需要更新以反映买家,并向买家发送销售发票。在图 8-5 的情况下,销售给了 Bryce(因为他执行了赢得更新 listings 表的交易),但发票发送给了 Aaliyah(因为她执行了赢得更新 invoices 表的交易)。读已提交可以防止此类意外。
  • However, read committed does not prevent the race condition between two counter increments in Figure 8-1. In this case, the second write happens after the first transaction has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in “Preventing Lost Updates” we will discuss how to make such counter increments safe.
    然而,可重复读并不能防止图 8-1 中两个计数器增量之间的竞争条件。在这种情况下,第二次写发生在第一个事务提交之后,所以它不是脏写。它仍然是不正确的,但原因不同 —— 在 “防止更新丢失” 中,我们将讨论如何使这种计数器增量操作安全。
    ddia 0805

Implementing read committed 实现读已提交#

Read committed is a very popular isolation level. It is the default setting in Oracle Database, PostgreSQL, SQL Server, and many other databases [10].
可重复读是一种非常流行的隔离级别。它是 Oracle 数据库、PostgreSQL、SQL Server 以及许多其他数据库的默认设置 [10]。

Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular row (or document or some other object), it must first acquire a lock on that row. It must then hold that lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given row; if another transaction wants to write to the same row, it must wait until the first transaction is committed or aborted before it can acquire the lock and continue. This locking is done automatically by databases in read committed mode (or stronger isolation levels).
通常情况下,数据库通过行级锁来防止脏写:当事务想要修改特定的一行(或文档或其他对象)时,它必须首先获取该行的锁。然后,它必须持有该锁直到事务提交或中止。任何给定行只能由一个事务持有锁;如果另一个事务想要写入同一行,它必须等待第一个事务提交或中止后才能获取锁并继续。这种锁机制在读取提交模式(或更高隔离级别)下由数据库自动完成。

How do we prevent dirty reads? One option would be to use the same lock, and to require any transaction that wants to read a row to briefly acquire the lock and then release it again immediately after reading. This would ensure that a read couldn’t happen while a row has a dirty, uncommitted value (because during that time the lock would be held by the transaction that has made the write).
我们如何防止脏读?一个选项是使用相同的锁,并要求任何想要读取行的交易短暂地获取锁,然后在读取后立即再次释放它。这将确保在行具有脏的、未提交的值时不会发生读取(因为在那个时间,锁会被进行写入的交易持有)。

However, the approach of requiring read locks does not work well in practice, because one long-running write transaction can force many other transactions to wait until the long-running transaction has completed, even if the other transactions only read and do not write anything to the database. This harms the response time of read-only transactions and is bad for operability: a slowdown in one part of an application can have a knock-on effect in a completely different part of the application, due to waiting for locks.
然而,要求读取锁的方法在实际中效果不佳,因为一个长时间运行的写入事务可能会迫使许多其他事务等待,即使这些事务只是读取而不向数据库写入任何数据。这会损害只读事务的响应时间,并且对可操作性不利:应用程序某一部分的延迟可能会由于等待锁而影响到完全不同的另一部分。

Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM Db2 and Microsoft SQL Server in the read_committed_snapshot=off setting [30].
然而,在某些数据库中,如 IBM Db2 和 Microsoft SQL Server 在 read_committed_snapshot=off 设置 [30] 中,使用锁来防止脏读。

A more commonly used approach to preventing dirty reads is the one illustrated in Figure 8-4: for every row that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. While the transaction is ongoing, any other transactions that read the row are simply given the old value. Only when the new value is committed do transactions switch over to reading the new value (see “Multi-version concurrency control (MVCC)” for more detail).
一种更常用的防止脏读的方法如图 8-4 所示:对于每个被写入的行,数据库都记住旧提交的值和当前持有写锁的事务所设置的新值。在事务进行期间,任何读取该行的其他事务只是被给予旧值。只有当新值被提交时,事务才会切换到读取新值(有关更多细节,请参阅 “多版本并发控制(MVCC)”)。

Snapshot Isolation and Repeatable Read 快照隔离和可重复读#

If you look superficially at read committed isolation, you could be forgiven for thinking that it does everything that a transaction needs to do: it allows aborts (required for atomicity), it prevents reading the incomplete results of transactions, and it prevents concurrent writes from getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can get from a system that has no transactions.
如果你对读已提交隔离了解不深,你可能会误认为那就是事务所要做的所有你可能会被原谅地认为它做了一切一个事务需要做的事情:它允许中止(原子性所必需的),它阻止读取事务的不完整结果,并且它阻止并发写入相互混合。确实,这些是有用的特性,而且比没有事务的系统能提供的更强保证。

However, there are still plenty of ways in which you can have concurrency bugs when using this isolation level. For example, Figure 8-6 illustrates a problem that can occur with read committed.
然而,在使用这种隔离级别时,仍然有很多方法可能导致并发问题。例如,图 8-6 说明了读已提交可能发生的问题。

ddia 0806

Say Aaliyah has 1,000ofsavingsatabank,splitacrosstwoaccountswith1,000 of savings at a bank, split across two accounts with 500 each. Now a transaction transfers 100fromoneofheraccountstotheother.Ifsheisunluckyenoughtolookatherlistofaccountbalancesinthesamemomentasthattransactionisbeingprocessed,shemayseeoneaccountbalanceatatimebeforetheincomingpaymenthasarrived(withabalanceof100 from one of her accounts to the other. If she is unlucky enough to look at her list of account balances in the same moment as that transaction is being processed, she may see one account balance at a time before the incoming payment has arrived (with a balance of 500), and the other account after the outgoing transfer has been made (the new balance being 400).ToAaliyahitnowappearsasthoughsheonlyhasatotalof400). To Aaliyah it now appears as though she only has a total of 900 in her accounts—it seems that $100 has vanished into thin air.
说 Aaliyah 在银行有两个账户,每个账户有 500 美元的存款,总共 1000 美元。现在有一笔交易将 100 美元从一个账户转移到另一个账户。如果她不幸在交易处理的同时查看她的账户余额列表,她可能会看到其中一个账户余额在收款到达前显示为 500 美元,而另一个账户在转账完成后显示为 400 美元。对 Aaliyah 来说,现在看起来她的账户总共有 900 美元 —— 似乎有 100 美元凭空消失了。

This anomaly is called read skew, and it is an example of a nonrepeatable read: if Aaliyah were to read the balance of account 1 again at the end of the transaction, she would see a different value (600)thanshesawinherpreviousquery.Readskewisconsideredacceptableunderreadcommittedisolation:theaccountbalancesthatAaliyahsawwereindeedcommittedatthetimewhenshereadthem. 这种异常被称为读倾斜,它是一个不可重复读的例子:如果Aaliyah在事务结束时再次读取账户1的余额,她会看到一个不同的值(600) than she saw in her previous query. Read skew is considered acceptable under read committed isolation: the account balances that Aaliyah saw were indeed committed at the time when she read them.\ 这种异常被称为读倾斜,它是一个不可重复读的例子:如果 Aaliyah 在事务结束时再次读取账户 1 的余额,她会看到一个不同的值(600),而不是她之前查询时看到的值。在读取提交隔离级别下,读倾斜被认为是可接受的:Aaliyah 看到的账户余额确实是在她读取它们的时候提交的。

Note 注意#

The term skew is unfortunately overloaded: we previously used it in the sense of an unbalanced workload with hot spots (see “Skewed Workloads and Relieving Hot Spots”), whereas here it means timing anomaly.
术语 “skew” 不幸地被多重使用:我们之前用它表示不平衡的工作负载和热点(参见 “倾斜工作负载和缓解热点”),而在这里它指的是时间异常。

In Aaliyah’s case, this is not a lasting problem, because she will most likely see consistent account balances if she reloads the online banking website a few seconds later. However, some situations cannot tolerate such temporary inconsistency:
在 Aaliyah 的情况下,这不是一个持续的问题,因为她稍后重新加载在线银行网站时,很可能会看到一致的账户余额。然而,有些情况无法容忍这种临时不一致性:

Backups 备份

Taking a backup requires making a copy of the entire database, which may take hours on a large database. During the time that the backup process is running, writes will continue to be made to the database. Thus, you could end up with some parts of the backup containing an older version of the data, and other parts containing a newer version. If you need to restore from such a backup, the inconsistencies (such as disappearing money) become permanent.
备份需要复制整个数据库,对于大型数据库来说可能需要数小时。在备份过程运行期间,数据库会继续接收写入操作。因此,你可能会发现备份的不同部分包含不同版本的数据:一部分是旧版本,另一部分是新版本。如果你需要从这样的备份中恢复数据,不一致性(例如消失的钱)将永久存在。

Analytic queries and integrity checks
分析查询和完整性检查

Sometimes, you may want to run a query that scans over large parts of the database. Such queries are common in analytics (see “Analytical versus Operational Systems”), or may be part of a periodic integrity check that everything is in order (monitoring for data corruption). These queries are likely to return nonsensical results if they observe parts of the database at different points in time.
有时,你可能需要运行一个扫描数据库大部分区域的查询。这类查询常见于分析(参见 “分析型与操作型系统”),或可能是周期性完整性检查的一部分,以确保一切正常(监控数据损坏)。如果这些查询在不同时间点观察到数据库的不同部分,它们很可能会返回无意义的结果。

Snapshot isolation [37] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.
快照隔离 [37] 是解决这个问题的最常见方案。其思路是每个事务都从数据库的一致性快照中读取 —— 也就是说,事务能看到事务开始时数据库中已提交的所有数据。即使数据随后被其他事务更改,每个事务也只看到该特定时间点的旧数据。

Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It is very hard to reason about the meaning of a query if the data on which it operates is changing at the same time as the query is executing. When a transaction can see a consistent snapshot of the database, frozen at a particular point in time, it is much easier to understand.
快照隔离对长时间运行的只读查询(如备份和分析)非常有利。如果查询执行时操作的数据也在变化,很难推断查询的含义。当事务能看到数据库在特定时间点的快照时,理解起来会容易得多。

Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from one system to the next [30,41,42]. Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their highest isolation level.
快照隔离是一个流行的功能:它的不同变体被 PostgreSQL、使用 InnoDB 存储引擎的 MySQL、Oracle、SQL Server 等系统支持,尽管每个系统的具体行为有所不同 [30, 41, 42]。一些数据库,如 Oracle、TiDB 和 Aurora DSQL,甚至将快照隔离作为其最高的隔离级别。

Multi-version concurrency control (MVCC) 多版本并发控制#

Like read committed isolation, implementations of snapshot isolation typically use write locks to prevent dirty writes (see “Implementing read committed”), which means that a transaction that makes a write can block the progress of another transaction that writes to the same row. However, reads do not require any locks. From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers. This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.
与读已提交隔离类似,快照隔离的实现通常使用写锁来防止脏读(参见 “实现读已提交”),这意味着一个进行写入的事务可能会阻塞另一个写入相同行的交易。然而,读取操作不需要任何锁。从性能角度来看,快照隔离的一个关键原则是读者从不阻塞写者,写者也从不阻塞读者。这使得数据库能够在处理正常写入的同时,对一致性的快照执行长时间运行的读取查询,而两者之间没有任何锁竞争。

To implement snapshot isolation, databases use a generalization of the mechanism we saw for preventing dirty reads in Figure 8-4. Instead of two versions of each row (the committed version and the overwritten-but-not-yet-committed version), the database must potentially keep several different committed versions of a row, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of a row side by side, this technique is known as multi-version concurrency control (MVCC).
为了实现快照隔离,数据库使用了一种我们曾在图 8-4 中看到的防止脏读机制的泛化。数据库需要保持一个行可能存在的多个不同已提交版本,因为多个进行中的事务可能需要在不同的时间点看到数据库的状态。由于它并排维护一个行的多个版本,这种技术被称为多版本并发控制(MVCC)。

Figure 8-7 illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL [41,43,44] (other implementations are similar). When a transaction is started, it is given a unique, always-increasing transaction ID (txid). Whenever a transaction writes anything to the database, the data it writes is tagged with the transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to ensure that overflow does not affect the data.)
图 8-7 说明了 PostgreSQL 中基于 MVCC 的快照隔离的实现方式 [41, 43, 44](其他实现方式类似)。当启动一个事务时,系统会为其分配一个唯一且持续递增的事务 ID( txid )。每当事务向数据库写入数据时,写入的数据都会被标记上写入者的事务 ID。(精确来说,PostgreSQL 中的事务 ID 是 32 位整数,大约在 40 亿个事务后会发生溢出。清理进程会进行清理工作,以确保溢出不会影响数据。)

ddia 0807

Each row in a table has a inserted_by field, containing the ID of the transaction that inserted this row into the table. Moreover, each row has a deleted_by field, which is initially empty. If a transaction deletes a row, the row isn’t actually removed from the database, but it is marked for deletion by setting the deleted_by field to the ID of the transaction that requested the deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space.
表中的每一行都有一个 inserted_by 字段,包含将此行插入表中的事务 ID。此外,每一行都有一个 deleted_by 字段,初始为空。如果事务删除一行,该行实际上不会被从数据库中删除,而是通过将 deleted_by 字段设置为请求删除的事务 ID 来标记为删除。在某个稍后的时间点,当确定没有事务可以再访问被删除的数据时,数据库中的垃圾回收过程会删除所有标记为删除的行并释放它们的空间。

An update is internally translated into a delete and a insert [45]. For example, in Figure 8-7, transaction 13 deducts 100fromaccount2,changingthebalancefrom100 from account 2, changing the balance from 500 to 400.Theaccountstablenowactuallycontainstworowsforaccount2:arowwithabalanceof400. The `accounts` table now actually contains two rows for account 2: a row with a balance of 500 which was marked as deleted by transaction 13, and a row with a balance of $400 which was inserted by transaction 13.
更新操作在内部被转换为删除和插入操作 [45]。例如,在图 8-7 中,事务 13 从账户 2 扣除 100 美元,将余额从 500 美元变为 400 美元。 accounts 表现在实际上包含账户 2 的两行:一行余额为 500 美元,被事务 13 标记为删除;另一行余额为 400 美元,由事务 13 插入。

All of the versions of a row are stored within the same database heap (see “Storing values within the index”), regardless of whether the transactions that wrote them have committed or not. The versions of the same row form a linked list, going either from newest version to oldest version or the other way round, so that queries can internally iterate over all versions of a row [46,47].
同一行的所有版本都存储在同一个数据库堆中(参见 “在索引中存储值”),无论写入它们的交易是否已提交。同一行的版本形成一个链表,从最新版本到最旧版本或相反方向,以便查询可以内部迭代所有行的版本 [46,47]。

Visibility rules for observing a consistent snapshot 观察一致性快照的可见性规则#

When a transaction reads from the database, transaction IDs are used to decide which row versions it can see and which are invisible. By carefully defining visibility rules, the database can present a consistent snapshot of the database to the application. This works roughly as follows [44]:
当事务从数据库读取时,使用事务 ID 来决定它能看到哪些行版本以及哪些是不可见的。通过仔细定义可见性规则,数据库可以向应用程序呈现一致的数据库快照。这大致工作如下 [44]:

  1. At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit. This ensures that we see a consistent snapshot that is not affected by another transaction committing.
    每个事务开始时,数据库会列出当时正在进行中的所有其他事务(尚未提交或中止)。这些事务所做的任何写入操作都会被忽略,即使这些事务随后提交了。这确保我们看到的是一个一致的快照,不受其他事务提交的影响。
  2. Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started, and which are therefore not included in the list of in-progress transactions) are ignored, regardless of whether those transactions have committed.
    任何由具有较晚事务 ID 的事务(即,在当前事务开始后才启动的事务,因此不包含在正在进行中的事务列表中)所做的写入操作都会被忽略,无论这些事务是否已提交。
  3. Any writes made by aborted transactions are ignored, regardless of when that abort happened. This has the advantage that when a transaction aborts, we don’t need to immediately remove the rows it wrote from storage, since the visibility rule filters them out. The garbage collection process can remove them later.
    由中止事务产生的任何写入都会被忽略,无论中止发生在何时。这有一个优点,即当事务中止时,我们不需要立即从存储中删除它写入的行,因为可见性规则会过滤掉它们。垃圾回收过程可以在稍后删除它们。
  4. All other writes are visible to the application’s queries.
    所有其他写入对应用程序的查询都是可见的。

These rules apply to both insertion and deletion of rows. In Figure 8-7, when transaction 12 reads from account 2, it sees a balance of 500becausethedeletionofthe500 because the deletion of the 500 balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).
这些规则适用于行的插入和删除。在图 8-7 中,当事务 12 从账户 2 读取时,它看到 500 美元的余额,因为 500 美元的余额删除是由事务 13 完成的(根据规则 2,事务 12 不能看到事务 13 所做的删除),而 400 美元的余额插入尚未可见(根据同样的规则)。

Put another way, a row is visible if both of the following conditions are true:
换句话说,如果以下两个条件都为真,则行是可见的:

  • At the time when the reader’s transaction started, the transaction that inserted the row had already committed.
    在读取者的事务开始时,插入该行的交易已经提交。
  • The row is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed at the time when the reader’s transaction started.
    该行没有被标记为删除,或者如果被标记了,请求删除的事务在读取者的事务开始时还没有提交。

A long-running transaction may continue using a snapshot for a long time, continuing to read values that (from other transactions’ point of view) have long been overwritten or deleted. By never updating values in place but instead inserting a new version every time a value is changed, the database can provide a consistent snapshot while incurring only a small overhead.
一个长时间运行的事务可能会长时间使用一个快照,继续读取那些(从其他事务的角度来看)早已被覆盖或删除的值。通过从不就地更新值,而是在每次值发生变化时插入一个新版本,数据库可以在只产生很小开销的情况下提供一致的快照。

Indexes and snapshot isolation 索引和快照隔离#

How do indexes work in a multi-version database? The most common approach is that each index entry points at one of the versions of a row that matches the entry (either the oldest or the newest version). Each row version may contain a reference to the next-oldest or next-newest version. A query that uses the index must then iterate over the rows to find one that is visible, and where the value matches what the query is looking for. When garbage collection removes old row versions that are no longer visible to any transaction, the corresponding index entries can also be removed.
多版本数据库中的索引是如何工作的?最常见的方法是每个索引条目指向匹配该条目的某一行版本(要么是最旧的版本,要么是最新的版本)。每一行版本可能包含对下一个最旧或下一个最新版本的引用。使用索引的查询必须遍历行以找到可见且值与查询查找内容匹配的行。当垃圾回收移除不再对任何事务可见的旧行版本时,相应的索引条目也可以被移除。

Many implementation details affect the performance of multi-version concurrency control [46, 47]. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the same row can fit on the same page [41]. Some other databases avoid storing full copies of modified rows, and only store differences between versions to save space.
许多实现细节会影响多版本并发控制的性能 [46,47]。例如,PostgreSQL 有优化措施,以避免在相同页面上可以容纳同一行的不同版本时进行索引更新 [41]。其他一些数据库避免存储修改行的完整副本,而只存储版本之间的差异以节省空间。

Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see “B-Trees”), they use an immutable (copy-on-write) variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and can be shared with the new tree [48].
另一种方法在 CouchDB、Datomic 和 LMDB 中使用。尽管它们也使用 B 树(见 “B 树”),但它们使用的是不可变(写时复制)的变体,在更新树页时不会覆盖页面,而是为每个修改的页面创建一个新的副本。父页面,直到树的根,会被复制并更新以指向其子页面的新版本。任何未受写入影响的页面不需要被复制,并且可以与新的树共享 [48]。

With immutable B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created. There is no need to filter out rows based on transaction IDs because subsequent writes cannot modify an existing B-tree; they can only create new tree roots. This approach also requires a background process for compaction and garbage collection.
对于不可变的 B 树,每次写事务(或事务批处理)都会创建一个新的 B 树根节点,而特定的根节点是它在创建时数据库状态的快照。无需根据事务 ID 过滤行,因为后续的写操作无法修改现有的 B 树;它们只能创建新的树根。这种方法还需要一个后台进程来进行压缩和垃圾回收。

Snapshot isolation, repeatable read, and naming confusion 快照隔离、可重复读和命名混淆#

MVCC is a commonly used implementation technique for databases, and often it is used to implement snapshot isolation. However, different databases sometimes use different terms to refer to the same thing: for example, snapshot isolation is called “repeatable read” in PostgreSQL, and “serializable” in Oracle [30]. Sometimes different systems use the same term to mean different things: for example, while in PostgreSQL “repeatable read” means snapshot isolation, in MySQL it means an implementation of MVCC with weaker consistency than snapshot isolation [42].
MVCC 是一种常用于数据库的实现技术,通常用于实现快照隔离。然而,不同的数据库有时会用不同的术语来指代同一事物:例如,快照隔离在 PostgreSQL 中被称为 “可重复读”,在 Oracle 中被称为 “可串行化”[30]。有时不同的系统会用相同的术语来表示不同的事物:例如,虽然 PostgreSQL 中的 “可重复读” 表示快照隔离,但在 MySQL 中它表示一种比快照隔离一致性更弱的 MVCC 实现 [42]。

The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot isolation, because the standard is based on System R’s 1975 definition of isolation levels [3] and snapshot isolation hadn’t yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the requirements of the standard, and so they can claim standards compliance.
这种命名混淆的原因是 SQL 标准没有快照隔离的概念,因为该标准基于 System R 在 1975 年对隔离级别的定义 [3],当时快照隔离尚未发明。相反,它定义了可重复读,表面上看与快照隔离相似。PostgreSQL 将其快照隔离级别称为 “可重复读”,因为它满足了标准的要求,因此他们可以声称符合标准。

Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, imprecise, and not as implementation-independent as a standard should be [37]. Even though several databases implement repeatable read, there are big differences in the guarantees they actually provide, despite being ostensibly standardized [30]. There has been a formal definition of repeatable read in the research literature [38,39], but most implementations don’t satisfy that formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability [10].
不幸的是,SQL 标准的隔离级别定义存在缺陷 —— 它模糊不清、不够精确,并且不像标准应有的那样独立于实现 [37]。尽管一些数据库实现了可重复读,但它们实际提供的保证存在巨大差异,尽管表面上已标准化 [30]。研究文献中对可重复读有正式定义 [38,39],但大多数实现都不满足该正式定义。更糟糕的是,IBM Db2 将 “可重复读” 用于指代可串行化 [10]。

As a result, nobody really knows what repeatable read means.
因此,没有人真正知道可重复读是什么意思。

Preventing Lost Updates 防止更新丢失#

The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored the issue of two transactions writing concurrently—we have only discussed dirty writes (see “No dirty writes”), one particular type of write-write conflict that can occur.
我们之前讨论的读已提交和快照隔离级别主要关注的是在并发写入的情况下,只读事务能看到什么保证。我们基本上忽略了两个事务并发写入的问题 —— 我们只讨论了脏写(参见 “不允许脏写”),这是一种特定的写写冲突类型。

There are several other interesting kinds of conflicts that can occur between concurrently writing transactions. The best known of these is the lost update problem, illustrated in Figure 8-1 with the example of two concurrent counter increments.
同时写入的事务之间可能会发生其他几种有趣的冲突。其中最著名的是丢失更新问题,图 8-1 通过两个并发计数器递增的例子进行了说明。

The lost update problem can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification. (We sometimes say that the later write clobbers the earlier write.) This pattern occurs in various different scenarios:
如果应用程序从数据库中读取某个值、修改它并写回修改后的值(一个读 - 改 - 写循环),可能会发生丢失更新问题。如果两个事务同时这样做,其中一个修改可能会丢失,因为第二次写操作不包括第一次修改。(我们有时会说,后来的写操作覆盖了先前的写操作。)这种模式在各种不同的场景中都会出现:

  • Incrementing a counter or updating an account balance (requires reading the current value, calculating the new value, and writing back the updated value)
    递增计数器或更新账户余额(需要读取当前值、计算新值并写回更新后的值)
  • Making a local change to a complex value, e.g., adding an element to a list within a JSON document (requires parsing the document, making the change, and writing back the modified document)
    对复杂值进行本地更改,例如在 JSON 文档中的列表内添加元素(需要解析文档、进行更改并写回修改后的文档)
  • Two users editing a wiki page at the same time, where each user saves their changes by sending the entire page contents to the server, overwriting whatever is currently in the database
    两个用户同时编辑同一篇维基页面,每个用户通过将整个页面内容发送到服务器来保存他们的更改,覆盖数据库中当前的内容

Because this is such a common problem, a variety of solutions have been developed [49].
由于这是一个非常普遍的问题,已经开发出各种解决方案 [49]。

Atomic write operations 原子写入操作#

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solution if your code can be expressed in terms of those operations. For example, the following instruction is concurrency-safe in most relational databases:
许多数据库提供原子更新操作,这消除了在应用程序代码中实现读 - 改 - 写循环的需要。如果你的代码可以用这些操作来表示,它们通常是最佳解决方案。例如,以下指令在大多数关系型数据库中是并发安全的:

Similarly, document databases such as MongoDB provide atomic operations for making local modifications to a part of a JSON document, and Redis provides atomic operations for modifying data structures such as priority queues. Not all writes can easily be expressed in terms of atomic operations—for example, updates to a wiki page involve arbitrary text editing, which can be handled using algorithms discussed in “CRDTs and Operational Transformation” —but in situations where atomic operations can be used, they are usually the best choice.
类似地,MongoDB 等文档数据库为对 JSON 文档的部分进行本地修改提供了原子操作,而 Redis 为修改优先队列等数据结构提供了原子操作。并非所有写入都能轻易用原子操作来表达 —— 例如,对维基页面的更新涉及任意文本编辑,这可以使用 “CRDTs 和操作转换” 中讨论的算法来处理 —— 但在可以使用原子操作的情况下,它们通常是最佳选择。

Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. Another option is to simply force all atomic operations to be executed on a single thread.
原子操作通常通过在读取对象时对其加独占锁来实现,这样其他事务就不能读取它,直到更新已应用。另一个选项是强制所有原子操作都在单个线程上执行。

Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database [50,51,52]. This can be a source of subtle bugs that are difficult to find by testing.
不幸的是,对象关系映射(ORM)框架容易让人不小心写出执行不安全读 - 改 - 写循环的代码,而不是使用数据库提供的原子操作 [50, 51, 52]。这可能是导致难以通过测试发现的微妙错误的来源。

Explicit locking 显式锁定#

Another option for preventing lost updates, if the database’s built-in atomic operations don’t provide the necessary functionality, is for the application to explicitly lock objects that are going to be updated. Then the application can perform a read-modify-write cycle, and if any other transaction tries to concurrently update or lock the same object, it is forced to wait until the first read-modify-write cycle has completed.
如果数据库的内置原子操作不提供必要的功能,防止更新丢失的另一种选择是应用程序显式锁定将要更新的对象。然后应用程序可以执行读 - 修改 - 写周期,如果任何其他事务尝试同时更新或锁定同一个对象,它被迫等待第一个读 - 修改 - 写周期完成。

For example, consider a multiplayer game in which several players can move the same figure concurrently. In this case, an atomic operation may not be sufficient, because the application also needs to ensure that a player’s move abides by the rules of the game, which involves some logic that you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two players from concurrently moving the same piece, as illustrated in Example 8-1.
例如,考虑一个多人游戏,其中多个玩家可以同时移动同一个角色。在这种情况下,原子操作可能不够用,因为应用程序还需要确保玩家的移动符合游戏规则,这涉及到一些不能合理地作为数据库查询实现的逻辑。相反,你可以使用锁来防止两个玩家同时移动同一个棋子,如示例 8-1 所示。

Example 8-1. Explicitly locking rows to prevent lost updates 示例 8-1. 显式锁定行以防止更新丢失#

The FOR UPDATE clause indicates that the database should take a lock on all rows returned by this query.
FOR UPDATE 子句表示数据库应该对查询返回的所有行加锁。

This works, but to get it right, you need to carefully think about your application logic. It’s easy to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.
这种方法可行,但要正确实现,需要仔细思考应用逻辑。很容易忘记在代码的某个地方添加必要的锁,从而引入竞争条件。

Moreover, if you lock multiple objects there is a risk of deadlock, where two or more transactions are waiting for each other to release their locks. Many databases automatically detect deadlocks, and abort one of the involved transactions so that the system can make progress. You can handle this situation at the application level by retrying the aborted transaction.
此外,如果你锁定多个对象,存在死锁的风险,即两个或多个事务互相等待对方释放锁。许多数据库会自动检测死锁,并中止其中一个涉及的事务,以便系统可以继续运行。你可以在应用层通过重试中止的事务来处理这种情况。

Automatically detecting lost updates 自动检测丢失更新#

Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.
原子操作和锁是通过强制读 - 改 - 写循环顺序发生来防止丢失更新的方法。另一种方法是允许它们并行执行,如果事务管理器检测到丢失更新,则中止事务并强制其重试读 - 改 - 写循环。

An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [30,42]. Some authors [37,39] argue that a database must prevent lost updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot isolation under this definition.
这种方法的优点是数据库可以结合快照隔离高效地执行此检查。确实,PostgreSQL 的重复读、Oracle 的可序列化以及 SQL Server 的快照隔离级别会自动检测丢失更新何时发生并中止引发问题的事务。然而,MySQL / InnoDB 的重复读不会检测丢失更新 [30, 42]。一些作者 [37, 39] 认为,数据库必须防止丢失更新才能算作提供快照隔离,因此根据这一定义,MySQL 不提供快照隔离。

Lost update detection is a great feature, because it doesn’t require application code to use any special database features—you may forget to use a lock or an atomic operation and thus introduce a bug, but lost update detection happens automatically and is thus less error-prone. However, you also have to retry aborted transactions at the application level.
更新丢失检测是一个很棒的功能,因为它不需要应用程序代码使用任何特殊的数据库功能 —— 你可能忘记使用锁或原子操作从而引入一个错误,但更新丢失检测会自动发生,因此出错的可能性较低。然而,你也必须在应用程序级别重试中止的事务。

Conditional writes (compare-and-set) 条件写入(比较并设置)#

In databases that don’t provide transactions, you sometimes find a conditional write operation that can prevent lost updates by allowing an update to happen only if the value has not changed since you last read it (previously mentioned in “Single-object writes”). If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried. It is the database equivalent of an atomic compare-and-set or compare-and-swap (CAS) instruction that is supported by many CPUs.
在那些不提供事务的数据库中,你有时会发现一种条件写入操作,它可以通过允许仅在值自上次读取以来未发生变化时才执行更新来防止更新丢失(此前在 “单对象写入” 中提到过)。如果当前值与你之前读取的不匹配,则更新不会生效,并且必须重试读 - 修改 - 写周期。这是许多 CPU 支持的原子比较并设置或比较并交换(CAS)指令的数据库等效操作。

For example, to prevent two users concurrently updating the same wiki page, you might try something like this, expecting the update to occur only if the content of the page hasn’t changed since the user started editing it:
例如,为了防止两个用户同时更新同一篇维基页面,您可以尝试以下方法,期望更新仅在页面内容自用户开始编辑以来未发生变化时才发生:

If the content has changed and no longer matches 'old content', this update will have no effect, so you need to check whether the update took effect and retry if necessary. Instead of comparing the full content, you could also use a version number column that you increment on every update, and apply the update only if the current version number hasn’t changed. This approach is sometimes called optimistic locking [53].
如果内容已更改且不再匹配 'old content' ,此更新将无效,因此您需要检查更新是否生效并在必要时重试。除了比较完整内容外,您还可以使用一个版本号列,每次更新时递增,并且只有当当前版本号未更改时才应用更新。这种方法有时被称为乐观锁 [53]。

Note that if another transaction has concurrently modified content, the new content may not be visible under the MVCC visibility rules (see “Visibility rules for observing a consistent snapshot”). Many implementations of MVCC have an exception to the visibility rules for this scenario, where values written by other transactions are visible to the evaluation of the WHERE clause of UPDATE and DELETE queries, even though those writes are not otherwise visible in the snapshot.
请注意,如果另一个事务同时修改了 content ,根据 MVCC 可见性规则(参见 “观察一致性快照的可见性规则”),新内容可能不可见。许多 MVCC 实现对此场景的可见性规则有例外,即其他事务写入的值对 UPDATEDELETE 查询的 WHERE 子句的评估可见,尽管这些写入在快照中不可见。

Conflict resolution and replication 冲突解决与复制#

In replicated databases (see Chapter 6), preventing lost updates takes on another dimension: since they have copies of the data on multiple nodes, and the data can potentially be modified concurrently on different nodes, some additional steps need to be taken to prevent lost updates.
在复制的数据库(见第 6 章)中,防止数据更新丢失呈现出另一个维度:由于它们在多个节点上都有数据副本,并且数据可能在不同的节点上同时被修改,因此需要采取一些额外的步骤来防止数据更新丢失。

Locks and conditional write operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or conditional writes do not apply in this context. (We will revisit this issue in more detail in “Linearizability”.)
锁和条件写入操作假设存在一个最新数据副本。然而,具有多主或无主复制的数据库通常允许多个写入操作同时发生并异步复制,因此它们无法保证存在一个最新数据副本。因此,基于锁或条件写入的技术在这种情况下不适用。(我们将在 “线性化” 部分更详细地讨论这个问题。)

Instead, as discussed in “Dealing with Conflicting Writes”, a common approach in such replicated databases is to allow concurrent writes to create several conflicting versions of a value (also known as siblings), and to use application code or special data structures to resolve and merge these versions after the fact.
相反,正如在 “处理冲突写入” 中讨论的那样,在这样的事务数据库中,一种常见的方法是允许并发写入以创建多个冲突版本的价值(也称为兄弟),并使用应用程序代码或特殊数据结构事后解决和合并这些版本。

Merging conflicting values can prevent lost updates if the updates are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind CRDTs, which we encountered in “CRDTs and Operational Transformation”. However, some operations such as conditional writes cannot be made commutative.
合并冲突值可以防止更新丢失,如果这些更新是可交换的(即,你可以在不同的副本上以不同的顺序应用它们,仍然得到相同的结果)。例如,递增计数器或向集合中添加元素都是可交换的操作。这就是 CRDTs 背后的思想,我们在 “CRDTs 和操作转换” 中遇到过 CRDTs。然而,某些操作(如条件写入)无法使其可交换。

On the other hand, the last write wins (LWW) conflict resolution method is prone to lost updates, as discussed in “Last write wins (discarding concurrent writes)”. Unfortunately, LWW is the default in many replicated databases.
另一方面,最后写入者胜出(LWW)的冲突解决方法容易导致丢失更新,正如在 “最后写入者胜出(丢弃并发写入)” 中讨论的那样。不幸的是,LWW 是许多复用数据库的默认设置。

Write Skew and Phantoms 写入偏差和幻影#

In the previous sections we saw dirty writes and lost updates, two kinds of race conditions that can occur when different transactions concurrently try to write to the same objects. In order to avoid data corruption, those race conditions need to be prevented—either automatically by the database, or by manual safeguards such as using locks or atomic write operations.
在前几节中,我们看到了脏写和丢失更新,这两种竞争条件是在不同事务尝试同时写入相同对象时可能发生的。为了防止数据损坏,这些竞争条件需要被避免 —— 无论是数据库自动防止,还是通过手动保护措施,如使用锁或原子写入操作。

However, that is not the end of the list of potential race conditions that can occur between concurrent writes. In this section we will see some subtler examples of conflicts.
然而,这并非并发写入可能出现的潜在竞争条件的全部。在本节中,我们将看到一些更微妙冲突的例子。

To begin, imagine this example: you are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if they are sick themselves), provided that at least one colleague remains on call in that shift [54,55].
首先,想象这样一个例子:你正在为医生编写一个管理医院值班轮班的应用程序。医院通常希望同时有多名医生值班,但绝对必须至少有一名医生值班。医生可以放弃他们的轮班(例如,如果他们自己生病了),前提是同一轮班中至少还有一位同事在值班 [54, 55]。

Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button to go off call at approximately the same time. What happens next is illustrated in Figure 8-8.
现在假设 Aaliyah 和 Bryce 是某个特定轮次的两名值班医生。他们都感到不舒服,因此都决定请休假。不幸的是,他们恰好几乎同时点击了离岗按钮。接下来发生的情况如图 8-8 所示。

ddia 0808

In each transaction, your application first checks that two or more doctors are currently on call; if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot isolation, both checks return 2, so both transactions proceed to the next stage. Aaliyah updates her own record to take herself off call, and Bryce updates his own record likewise. Both transactions commit, and now no doctor is on call. Your requirement of having at least one doctor on call has been violated.
在每笔交易中,您的应用程序首先检查目前是否有两位或更多医生在值班;如果有的话,它就认为让一位医生离岗是安全的。由于数据库使用快照隔离,两个检查都返回 2 ,因此两个交易都进入下一阶段。Aaliyah 更新自己的记录以离岗,Bryce 也以同样的方式更新自己的记录。两个交易都提交了,现在没有医生在值班。您的至少需要有一位医生在值班的条件已经被违反。

Characterizing write skew 描述写偏差#

This anomaly is called write skew [37]. It is neither a dirty write nor a lost update, because the two transactions are updating two different objects (Aaliyah’s and Bryce’s on-call records, respectively). It is less obvious that a conflict occurred here, but it’s definitely a race condition: if the two transactions had run one after another, the second doctor would have been prevented from going off call. The anomalous behavior was only possible because the transactions ran concurrently.
这种异常被称为写倾斜 [37]。它既不是脏写,也不是更新丢失,因为这两个事务正在更新两个不同的对象(分别是 Aaliyah 和 Bryce 的值班记录)。这里发生的冲突并不那么明显,但它确实是一种竞态条件:如果这两个事务是按顺序运行的,第二个医生将无法轮班。这种异常行为之所以可能发生,是因为事务是并发运行的。

You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).
你可以将写倾斜视为更新丢失问题的泛化。如果两个事务读取相同的对象,然后更新其中一些对象(不同的事务可能更新不同的对象),就会发生写倾斜。在特殊情况下,如果不同的事务更新同一个对象,就会产生脏写或更新丢失异常(具体取决于时间点)。

We saw that there are various different ways of preventing lost updates. With write skew, our options are more restricted:
我们看到,防止更新丢失有多种方法。但在写倾斜的情况下,我们的选择更加受限:

  • Atomic single-object operations don’t help, as multiple objects are involved.
    原子单对象操作没有帮助,因为涉及多个对象。
  • The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level [30]. Automatically preventing write skew requires true serializable isolation (see “Serializability”).
    在一些实现快照隔离的系统中,自动检测丢失更新的功能也并不奏效:在 PostgreSQL 的可重复读、MySQL / InnoDB 的可重复读、Oracle 的可序列化级别或 SQL Server 的快照隔离级别中,写偏斜并未自动检测到 [30]。自动防止写偏斜需要真正的可序列化隔离(参见 “可序列化性”)。
  • Some databases allow you to configure constraints, which are then enforced by the database (e.g., uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to specify that at least one doctor must be on call, you would need a constraint that involves multiple objects. Most databases do not have built-in support for such constraints, but you may be able to implement them with triggers or materialized views, as discussed in “Consistency” [12].
    一些数据库允许你配置约束,这些约束随后由数据库强制执行(例如,唯一性、外键约束或特定值的限制)。然而,为了指定至少有一位医生必须值班,你需要一个涉及多个对象的约束。大多数数据库没有内置支持此类约束,但你可以使用触发器或物化视图来实现它们,如 “一致性”[12] 中所述。
  • If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. In the doctors example, you could write something like the following:
    如果你不能使用可序列化的隔离级别,在这种情况下,第二好的选择可能是显式锁定事务所依赖的行。在医生示例中,你可以编写类似以下的内容:
    As before, FOR UPDATE tells the database to lock all rows returned by this query.
    和之前一样, FOR UPDATE 告诉数据库锁定这个查询返回的所有行。

More examples of write skew 更多写倾斜的示例#

Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice more situations in which it can occur. Here are some more examples:
写倾斜问题最初可能看起来像是一个晦涩的问题,但一旦你意识到它,你可能会注意到更多可能发生这种情况的情况。这里有一些更多的例子:

Meeting room booking system
会议室预订系统

Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [56]. When someone wants to make a booking, you first check for any conflicting bookings (i.e., bookings for the same room with an overlapping time range), and if none are found, you create the meeting (see Example 8-2).
假设你想强制确保同一时间不能有两个相同的会议室预订 [56]。当有人想要预订时,你首先检查是否存在任何冲突的预订(即同一房间有重叠时间范围的预订),如果没有找到,就创建会议(见示例 8-2)。

Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation) 示例 8-2. 一个会议室预订系统试图避免重复预订(在快照隔离下不安全)#

Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable isolation.
不幸的是,快照隔离并不能阻止另一个用户同时插入一个冲突的会议。为了确保你不会遇到调度冲突,你再次需要可序列化隔离。

Multiplayer game 多人游戏

In Example 8-1, we used a lock to prevent lost updates (that is, making sure that two players can’t move the same figure at the same time). However, the lock doesn’t prevent players from moving two different figures to the same position on the board or potentially making some other move that violates the rules of the game. Depending on the kind of rule you are enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write skew.
在示例 8-1 中,我们使用锁来防止丢失更新(即确保两个玩家不能同时移动同一个棋子)。然而,锁并不能阻止玩家将两个不同的棋子移动到棋盘上的同一位置,或者可能做出其他违反游戏规则的操作。根据你正在执行的游戏规则类型,你可能可以使用唯一约束,否则你将容易受到写倾斜的影响。

Claiming a username 声明用户名

On a website where each user has a unique username, two users may try to create accounts with the same username at the same time. You may use a transaction to check whether a name is taken and, if not, create an account with that name. However, like in the previous examples, that is not safe under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second transaction that tries to register the username will be aborted due to violating the constraint).
在一个每个用户都有唯一用户名的网站上,两个用户可能会同时尝试创建具有相同用户名的账户。你可以使用一个事务来检查一个名字是否已被占用,如果未被占用,则创建一个具有该名字的账户。然而,像前面的例子一样,这在快照隔离下是不安全的。幸运的是,唯一约束在这里是一个简单的解决方案(尝试注册用户名的第二个事务将因违反约束而被中止)。

Preventing double-spending
防止双重花费

A service that allows users to spend money or points needs to check that a user doesn’t spend more than they have. You might implement this by inserting a tentative spending item into a user’s account, listing all the items in the account, and checking that the sum is positive. With write skew, it could happen that two spending items are inserted concurrently that together cause the balance to go negative, but that neither transaction notices the other.
一个允许用户花费金钱或积分的服务需要检查用户是否花费超过其拥有量。你可以通过将一个暂定的花费项目插入用户的账户,列出账户中的所有项目,并检查总和是否为正来实现这一点。在有写入倾斜的情况下,可能会发生两个花费项目同时插入的情况,这两个项目共同导致余额变为负数,但这两个交易都没有注意到对方。

Phantoms causing write skew 导致写入倾斜的幻影#

All of these examples follow a similar pattern:
所有这些例子都遵循一个相似的模式:

  1. A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition (there are at least two doctors on call, there are no existing bookings for that room at that time, the position on the board doesn’t already have another figure on it, the username isn’t already taken, there is still money in the account).
    一个 SELECT 查询通过搜索符合某些搜索条件的行来检查是否满足某个要求(至少有两名医生值班、该房间在那时没有现有预订、棋盘上的位置上没有其他棋子、用户名未被占用、账户中仍有余额)。
  2. Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort).
    根据第一个查询的结果,应用程序代码决定如何继续执行(或许继续进行操作,或许向用户报告错误并中止)。
  3. If the application decides to go ahead, it makes a write (INSERT, UPDATE, or DELETE) to the database and commits the transaction.
    如果应用程序决定继续执行,它会向数据库发送写入操作( INSERTUPDATEDELETE ),并提交事务。
    The effect of this write changes the precondition of the decision of step 2. In other words, if you were to repeat the SELECT query from step 1 after committing the write, you would get a different result, because the write changed the set of rows matching the search condition (there is now one fewer doctor on call, the meeting room is now booked for that time, the position on the board is now taken by the figure that was moved, the username is now taken, there is now less money in the account).
    这次写入操作的效果会改变步骤 2 决策的前置条件。换句话说,如果你在提交写入操作后重复执行步骤 1 中的 SELECT 查询,你会得到不同的结果,因为写入操作改变了符合搜索条件的行集合(现在值班医生少了一个,会议室已被预订,棋盘上的位置被移动的棋子占据,用户名已被占用,账户余额减少)。

The steps may occur in a different order. For example, you could first make the write, then the SELECT query, and finally decide whether to abort or commit based on the result of the query.
步骤的顺序可能会有所不同。例如,你可以先执行写入操作,然后执行 SELECT 查询,最后根据查询结果决定是否中止或提交。

In the case of the doctor on call example, the row being modified in step 3 was one of the rows returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows in step 1 (SELECT FOR UPDATE). However, the other four examples are different: they check for the absence of rows matching some search condition, and the write adds a row matching the same condition. If the query in step 1 doesn’t return any rows, SELECT FOR UPDATE can’t attach locks to anything [57].
在值班医生示例中,步骤 3 中修改的行是步骤 1 返回的行之一,因此我们可以通过在步骤 1 锁定行( SELECT FOR UPDATE )来使事务安全并避免写偏斜。然而,其他四个示例不同:它们检查不满足某些搜索条件的行,而写入操作添加满足相同条件的行。如果步骤 1 的查询没有返回任何行, SELECT FOR UPDATE 无法将锁附加到任何对象上 [57]。

This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [4]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL generated by ORMs is also prone to write skew [51,52].
当一个事务中的写入操作改变了另一个事务中的查询结果时,这种现象被称为幽灵 [4]。快照隔离可以避免只读查询中的幽灵,但在我们讨论的读写事务中,幽灵可能导致特别棘手的写偏斜问题。ORM 生成的 SQL 也容易导致写偏斜 [51,52]。

Materializing conflicts 显式化冲突#

If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database?
如果幽灵问题的症结在于我们无法将锁附加到某个对象上,那么我们是否可以人为地在数据库中引入一个锁对象呢?

For example, in the meeting room booking case you could imagine creating a table of time slots and rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15 minutes). You create rows for all possible combinations of rooms and time periods ahead of time, e.g. for the next six months.
例如,在会议室预订的案例中,你可以想象创建一个包含时间段和房间的表格。该表格的每一行对应于特定时间段内的特定房间(比如 15 分钟)。你提前为所有可能的房间和时间段的组合创建行,比如未来六个月。

Now a transaction that wants to create a booking can lock (SELECT FOR UPDATE) the rows in the table that correspond to the desired room and time period. After it has acquired the locks, it can check for overlapping bookings and insert a new booking as before. Note that the additional table isn’t used to store information about the booking—it’s purely a collection of locks which is used to prevent bookings on the same room and time range from being modified concurrently.
现在,一个想要创建预订的事务可以锁定( SELECT FOR UPDATE )对应所需房间和时间段的表格行。在获取锁之后,它可以检查是否存在重叠的预订,并像之前一样插入新的预订。请注意,额外的表格不是用来存储预订信息的 —— 它纯粹是一个锁的集合,用于防止同一房间和时间范围内的预订被并发修改。

This approach is called materializing conflicts, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database [14]. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much preferable in most cases.
这种方法称为显式化冲突,因为它将一个幻像转化为对数据库中实际存在的具体行集的锁冲突 [14]。不幸的是,显式化冲突很难且容易出错,而且让并发控制机制泄漏到应用数据模型中很不好看。出于这些原因,显式化冲突应该只在没有其他选择时作为最后手段考虑。在大多数情况下,可串行化隔离级别更可取。

Serializability 可串行化#

In this chapter we have seen several examples of transactions that are prone to race conditions. Some race conditions are prevented by the read committed and snapshot isolation levels, but others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s a sad situation:
本章我们看到多个容易产生竞态条件的交易示例。有些竞态条件被读已提交和快照隔离级别所防止,但另一些则没有。我们遇到了一些特别棘手的写偏斜和幻像示例。情况令人沮丧:

  • Isolation levels are hard to understand, and inconsistently implemented in different databases (e.g., the meaning of “repeatable read” varies significantly).
    隔离级别很难理解,并且在不同的数据库中实现不一致(例如,“可重复读” 的含义差异很大)。
  • If you look at your application code, it’s difficult to tell whether it is safe to run at a particular isolation level—especially in a large application, where you might not be aware of all the things that may be happening concurrently.
    如果你查看你的应用代码,很难判断在特定的隔离级别下运行是否安全 —— 尤其是在大型应用中,你可能并不清楚所有可能发生的并发情况。
  • There are no good tools to help us detect race conditions. In principle, static analysis may help [34], but research techniques have not yet found their way into practical use. Testing for concurrency issues is hard, because they are usually nondeterministic—problems only occur if you get unlucky with the timing.
    目前没有好的工具能帮助我们检测竞态条件。原则上,静态分析或许能提供帮助 [34],但研究技术尚未应用于实际。检测并发问题很困难,因为它们通常是非确定性的 —— 只有在时间上运气不好时才会出现问题。

This is not a new problem—it has been like this since the 1970s, when weak isolation levels were first introduced [3]. All along, the answer from researchers has been simple: use serializable isolation!
这不是一个新问题 —— 自从 1970 年代首次引入弱隔离级别以来就一直如此 [3]。研究人员一直给出的简单答案是:使用可序列化隔离!

Serializable isolation is the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. Thus, the database guarantees that if the transactions behave correctly when run individually, they continue to be correct when run concurrently—in other words, the database prevents all possible race conditions.
可序列化隔离是最高级别的隔离。它保证即使事务可能并行执行,最终结果也和它们逐个串行执行时一样,没有任何并发。因此,数据库保证如果事务单独运行时行为正确,那么并发运行时也继续正确 —— 换句话说,数据库防止了所有可能的竞态条件。

But if serializable isolation is so much better than the mess of weak isolation levels, then why isn’t everyone using it? To answer this question, we need to look at the options for implementing serializability, and how they perform. Most databases that provide serializability today use one of three techniques, which we will explore in the rest of this chapter:
但如果可串行化隔离比弱隔离级别的混乱好得多,那为什么不是每个人都用它呢?要回答这个问题,我们需要看看实现可串行化的选项,以及它们的性能。如今大多数提供可串行化功能的数据库使用以下三种技术中的一种,我们将在本章剩余部分探讨这些技术:

  • Literally executing transactions in a serial order (see “Actual Serial Execution”)
    实际按串行顺序执行交易(参见 “实际串行执行”)
  • Two-phase locking (see “Two-Phase Locking (2PL)”), which for several decades was the only viable option
    两阶段锁(参见 “两阶段锁(2PL)”,几十年来一直是唯一可行的选择)
  • Optimistic concurrency control techniques such as serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)”)
    乐观并发控制技术,如可串行化快照隔离(参见 “可串行化快照隔离(SSI)”)

Actual Serial Execution 实际串行执行#

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely sidestep the problem of detecting and preventing conflicts between transactions: the resulting isolation is by definition serializable.
避免并发问题的最简单方法是完全消除并发:一次只执行一个交易,按串行顺序,在单个线程上执行。通过这样做,我们完全回避了检测和防止交易之间冲突的问题:由此产生的隔离在定义上是可以串行化的。

Even though this seems like an obvious idea, it was only in the 2000s that database designers decided that a single-threaded loop for executing transactions was feasible [58]. If multi-threaded concurrency was considered essential for getting good performance during the previous 30 years, what changed to make single-threaded execution possible?
尽管这看起来是一个显而易见的想法,但数据库设计者直到 21 世纪才决定单线程循环执行事务是可行的 [58]。如果在前 30 年里,多线程并发被认为是获得良好性能的必要条件,那么是什么改变了,使得单线程执行成为可能?

Two developments caused this rethink:
导致这一重新思考的有两个发展:

  • RAM became cheap enough that for many use cases it is now feasible to keep the entire active dataset in memory (see “Keeping everything in memory”). When all data that a transaction needs to access is in memory, transactions can execute much faster than if they have to wait for data to be loaded from disk.
    RAM 变得足够便宜,以至于对于许多用例来说,现在将整个活动数据集保存在内存中是可行的(参见 “将所有内容保存在内存中”)。当交易所需访问的所有数据都在内存中时,交易的执行速度会比需要等待从磁盘加载数据时快得多。
  • Database designers realized that OLTP transactions are usually short and only make a small number of reads and writes (see “Analytical versus Operational Systems”). By contrast, long-running analytic queries are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation) outside of the serial execution loop.
    数据库设计者意识到 OLTP 事务通常较短,并且只进行少量读和写(参见 “分析型系统与操作型系统”)。相比之下,长时间运行的分析型查询通常是只读的,因此它们可以在串行执行循环之外,在一致性快照(使用快照隔离)上运行。

The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic, for example [59,60,61]. A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking. However, its throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.
串行执行事务的方法在 VoltDB/H-Store、Redis 和 Datomic 等系统中得到实现 [59, 60, 61]。为单线程执行设计的系统有时比支持并发的系统表现更好,因为它可以避免锁的协调开销。然而,其吞吐量受限于单个 CPU 核心的能力。为了充分利用这个单线程,事务需要与传统形式不同地构建。

Encapsulating transactions in stored procedures 将事务封装在存储过程中#

In the early days of databases, the intention was that a database transaction could encompass an entire flow of user activity. For example, booking an airline ticket is a multi-stage process (searching for routes, fares, and available seats; deciding on an itinerary; booking seats on each of the flights of the itinerary; entering passenger details; making payment). Database designers thought that it would be neat if that entire process was one transaction so that it could be committed atomically.
在数据库的早期,设想是数据库事务可以涵盖整个用户活动流程。例如,预订机票是一个多阶段的过程(搜索路线、票价和可用座位;决定行程;预订行程中每趟航班的位置;输入乘客信息;付款)。数据库设计者认为,如果整个流程是一个事务,那么它可以原子性地提交,这样会很整洁。

Unfortunately, humans are very slow to make up their minds and respond. If a database transaction needs to wait for input from a user, the database needs to support a potentially huge number of concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction. On the web, this means that a transaction is committed within the same HTTP request—a transaction does not span multiple requests. A new HTTP request starts a new transaction.
不幸的是,人类做决定和响应的速度非常慢。如果数据库事务需要等待用户输入,数据库就需要支持潜在的大量并发事务,其中大部分处于空闲状态。大多数数据库无法高效地做到这一点,因此几乎所有 OLTP 应用程序都通过避免在事务内交互式等待用户来保持事务简短。在 Web 上,这意味着事务在同一个 HTTP 请求内提交 —— 事务不会跨越多个请求。一个新的 HTTP 请求会启动一个新的事务。

Even though the human has been taken out of the critical path, transactions have continued to be executed in an interactive client/server style, one statement at a time. An application makes a query, reads the result, perhaps makes another query depending on the result of the first query, and so on. The queries and results are sent back and forth between the application code (running on one machine) and the database server (on another machine).
尽管人已经从关键路径中移除,但事务仍然以交互式客户端 / 服务器风格继续执行,一次一条语句。应用程序发起查询,读取结果,可能根据第一个查询的结果发起另一个查询,如此反复。查询和结果在应用程序代码(运行在一台机器上)与数据库服务器(运行在另一台机器上)之间来回传递。

In this interactive style of transaction, a lot of time is spent in network communication between the application and the database. If you were to disallow concurrency in the database and only process one transaction at a time, the throughput would be dreadful because the database would spend most of its time waiting for the application to issue the next query for the current transaction. In this kind of database, it’s necessary to process multiple transactions concurrently in order to get reasonable performance.
在这种交互式事务中,应用与数据库之间的网络通信会花费大量时间。如果你禁止数据库并发处理,只处理一个事务,那么吞吐量会非常糟糕,因为数据库大部分时间都会在等待应用发出当前事务的下一个查询。在这种数据库中,为了获得合理的性能,必须并发处理多个事务。

For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must either limit itself to transactions containing a single statement, or submit the entire transaction code to the database ahead of time, as a stored procedure [62].
因此,采用单线程串行事务处理的系统不允许交互式多语句事务。相反,应用程序必须限制自身只包含单语句的事务,或者提前将整个事务代码作为存储过程提交给数据库 [62]。

The differences between interactive transactions and stored procedures is illustrated in Figure 8-9. Provided that all data required by a transaction is in memory, the stored procedure can execute very quickly, without waiting for any network or disk I/O.
交互式事务与存储过程之间的区别如图 8-9 所示。如果事务所需的所有数据都在内存中,存储过程可以非常快速地执行,无需等待任何网络或磁盘 I / O。

ddia 0809

Pros and cons of stored procedures 存储过程的优缺点#

Stored procedures have existed for some time in relational databases, and they have been part of the SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:
存储过程在关系型数据库中已经存在一段时间了,自 1999 年以来一直是 SQL 标准(SQL / PSM)的一部分。由于各种原因,它们获得了一定的坏名声:

  • Traditionally, each database vendor had its own language for stored procedures (Oracle has PL/SQL, SQL Server has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in general-purpose programming languages, so they look quite ugly and archaic from today’s point of view, and they lack the ecosystem of libraries that you find with most programming languages.
    传统上,每个数据库供应商都有自己的存储过程语言(Oracle 有 PL/SQL,SQL Server 有 T - SQL,PostgreSQL 有 PL/pgSQL 等)。这些语言没有跟上通用编程语言的发展,从今天的角度来看,它们显得相当丑陋和过时,并且缺乏大多数编程语言所拥有的库生态系统。
  • Code running in a database is difficult to manage: compared to an application server, it’s harder to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to integrate with a metrics collection system for monitoring.
    在数据库中运行的代码难以管理:与应用服务器相比,它更难调试,更难进行版本控制和部署,更难测试,并且难以与用于监控的指标收集系统集成。
  • A database is often much more performance-sensitive than an application server, because a single database instance is often shared by many application servers. A badly written stored procedure (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent badly written code in an application server.
    数据库通常比应用服务器对性能更敏感,因为单个数据库实例通常由许多应用服务器共享。数据库中编写的糟糕的存储过程(例如,使用大量内存或 CPU 时间)可能比应用服务器中同等糟糕的代码造成更多麻烦。
  • In a multitenant system that allows tenants to write their own stored procedures, it’s a security risk to execute untrusted code in the same process as the database kernel [63].
    在一个允许多租户编写自己的存储过程的多租户系统中,在数据库内核进程执行不受信任的代码是一种安全风险 [63]。

However, those issues can be overcome. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, Datomic uses Java or Clojure, Redis uses Lua, and MongoDB uses Javascript.
然而,这些问题是可以克服的。现代存储过程实现已经放弃了 PL / SQL,转而使用现有的通用编程语言:VoltDB 使用 Java 或 Groovy,Datomic 使用 Java 或 Clojure,Redis 使用 Lua,MongoDB 使用 Javascript。

Stored procedures are also useful in cases where application logic can’t easily be embedded elsewhere. Applications that use GraphQL, for example, might directly expose their database through a GraphQL proxy. If the proxy doesn’t support complex validation logic, you can embed such logic directly in the database using a stored procedure. If the database doesn’t support stored procedures, you would have to deploy a validation service between the proxy and the database to do validation.
存储过程在应用逻辑难以嵌入其他地方的情况下也非常有用。例如,使用 GraphQL 的应用程序可能会通过 GraphQL 代理直接暴露其数据库。如果代理不支持复杂的验证逻辑,你可以使用存储过程直接在数据库中嵌入这种逻辑。如果数据库不支持存储过程,你将不得不在代理和数据库之间部署一个验证服务来执行验证。

With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. When stored procedures don’t need to wait for I/O and avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.
有了存储过程和内存中的数据,在单个线程上执行所有事务变得可行。当存储过程不需要等待 I / O 并避免其他并发控制机制的开销时,它们可以在单个线程上实现相当好的吞吐量。

VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from one node to another, it executes the same stored procedure on each replica. VoltDB therefore requires that stored procedures are deterministic (when run on different nodes, they must produce the same result). If a transaction needs to use the current date and time, for example, it must do so through special deterministic APIs (see “Durable Execution and Workflows” for more details on deterministic operations). This approach is called state machine replication, and we will return to it in Chapter 10.
VoltDB 也使用存储过程进行复制:它不是将一个事务的写入从一台节点复制到另一台节点,而是在每个副本上执行相同的存储过程。因此,VoltDB 要求存储过程必须是确定性的(在不同的节点上运行时,必须产生相同的结果)。例如,如果一个事务需要使用当前日期和时间,它必须通过特殊的确定性 API 来实现(有关确定性操作的更多详细信息,请参阅 “持久执行和工作流”)。这种方法称为状态机复制,我们将在第 10 章中再次讨论。

Sharding 分片#

Executing all transactions serially makes concurrency control much simpler, but limits the transaction throughput of the database to the speed of a single CPU core on a single machine. Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with high write throughput, the single-threaded transaction processor can become a serious bottleneck.
将所有事务串行执行可以大大简化并发控制,但会将数据库的事务吞吐量限制在单台机器上单个 CPU 核心的速度。只读事务可以在其他地方执行,使用快照隔离,但对于具有高写入吞吐量的应用,单线程事务处理器可能会成为一个严重的瓶颈。

In order to scale to multiple CPU cores, and multiple nodes, you can shard your data (see Chapter 7), which is supported in VoltDB. If you can find a way of sharding your dataset so that each transaction only needs to read and write data within a single shard, then each shard can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own shard, which allows your transaction throughput to scale linearly with the number of CPU cores [60].
为了扩展到多个 CPU 核心和多个节点,你可以对你的数据进行分片(见第 7 章),这在 VoltDB 中得到支持。如果你能找到一种分片方法,使得每个事务只需要读取和写入单个分片内的数据,那么每个分片都可以有自己的事务处理线程,独立于其他线程运行。在这种情况下,你可以给每个 CPU 核心分配自己的分片,这使得你的事务吞吐量可以随着 CPU 核心数量的增加而线性扩展 [60]。

However, for any transaction that needs to access multiple shards, the database must coordinate the transaction across all the shards that it touches. The stored procedure needs to be performed in lock-step across all shards to ensure serializability across the whole system.
然而,对于需要访问多个分片的任何事务,数据库必须协调该事务跨越所有它所触及的分片。存储过程需要在所有分片上同步执行,以确保整个系统中的串行化。

Since cross-shard transactions have additional coordination overhead, they are vastly slower than single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second, which is orders of magnitude below its single-shard throughput and cannot be increased by adding more machines [62]. More recent research has explored ways of making multi-shard transactions more scalable [64].
由于跨分片事务具有额外的协调开销,它们的速度远远慢于单分片事务。VoltDB 报告每秒约 1000 次跨分片写入的吞吐量,这比其单分片吞吐量低几个数量级,并且无法通过增加更多机器来提高 [62]。最近的研究探索了使多分片事务更具可扩展性的方法 [64]。

Whether transactions can be single-shard depends very much on the structure of the data used by the application. Simple key-value data can often be sharded very easily, but data with multiple secondary indexes is likely to require a lot of cross-shard coordination (see “Sharding and Secondary Indexes”).
事务是否可以单分片很大程度上取决于应用程序使用的数据结构。简单的键值数据通常可以非常容易地进行分片,但具有多个次要索引的数据很可能需要大量的跨分片协调(参见 “分片与次要索引”)。

Summary of serial execution 串行执行总结#

Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:
串行执行事务在某些约束条件下已成为实现可串行化隔离的一种可行方式:

  • Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.
    每个事务都必须小而快,因为只需要一个慢事务就能使所有事务处理停滞。
  • It is most appropriate in situations where the active dataset can fit in memory. Rarely accessed data could potentially be moved to disk, but if it needed to be accessed in a single-threaded transaction, the system would get very slow.
    在活跃数据集可以适合内存的情况下最为适用。很少访问的数据可以潜在地移到磁盘,但如果需要在单线程事务中访问,系统会变得非常慢。
  • Write throughput must be low enough to be handled on a single CPU core, or else transactions need to be sharded without requiring cross-shard coordination.
    写入吞吐量必须足够低,以便能在单个 CPU 核心上处理,否则需要将事务分片而不需要跨分片协调。
  • Cross-shard transactions are possible, but their throughput is hard to scale.
    跨分片事务是可能的,但其吞吐量难以扩展。

Two-Phase Locking (2PL) 两阶段锁(2PL)#

For around 30 years, there was only one widely used algorithm for serializability in databases:two-phase locking (2PL), sometimes called strong strict two-phase locking (SS2PL) to distinguish it from other variants of 2PL.
大约有 30 年的时间,数据库中广泛使用的可串行化算法只有一种:两阶段锁(2PL),有时也称为强严格两阶段锁(SS2PL),以区别于其他 2PL 变体。

2PL is not 2PC2PL 不是 2PC#

Two-phase locking (2PL) and two-phase commit (2PC) are two very different things. 2PL provides serializable isolation, whereas 2PC provides atomic commit in a distributed database (see “Two-Phase Commit (2PC)”). To avoid confusion, it’s best to think of them as entirely separate concepts and to ignore the unfortunate similarity in the names.
两阶段锁(2PL)和两阶段提交(2PC)是两种截然不同的概念。2PL 提供可串行化隔离,而 2PC 提供分布式数据库中的原子提交(参见 “两阶段提交(2PC)”)。为了避免混淆,最好将它们视为完全独立的概念,并忽略它们名称的不幸相似之处。

We saw previously that locks are often used to prevent dirty writes (see “No dirty writes”): if two transactions concurrently try to write to the same object, the lock ensures that the second writer must wait until the first one has finished its transaction (aborted or committed) before it may continue.
我们之前看到锁通常用于防止脏写(参见 “禁止脏写”):如果两个事务同时尝试写入同一个对象,锁确保第二个写入者必须等待第一个事务完成(中止或提交)后才能继续。

Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:
两阶段锁机制类似,但锁的要求更强。只要没有人写入该对象,多个事务可以并发地读取同一个对象。但一旦有人想要写入(修改或删除)一个对象,就需要独占访问权限:

  • If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.)
    如果事务 A 已经读取了一个对象,而事务 B 想要写入该对象,B 必须等待 A 提交或中止后才能继续。(这确保了 B 不会在 A 背后意外地改变对象。)
  • If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object, like in Figure 8-4, is not acceptable under 2PL.)
    如果事务 A 已经写入了一个对象,而事务 B 想要读取该对象,B 必须等待 A 提交或中止后才能继续。(在 2PL 下,读取对象旧版本的情况,如图 8-4 所示,是不被允许的。)

In 2PL, writers don’t just block other writers; they also block readers and vice versa. Snapshot isolation has the mantra readers never block writers, and writers never block readers (see “Multi-version concurrency control (MVCC)”), which captures this key difference between snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability, it protects against all the race conditions discussed earlier, including lost updates and write skew.
在 2PL 中,写者不仅会阻塞其他写者,还会阻塞读者,反之亦然。快照隔离有句口号:读者从不阻塞写者,写者从不阻塞读者(参见 “多版本并发控制(MVCC)”),这捕捉了快照隔离和两阶段锁的关键区别。另一方面,由于 2PL 提供可序列化性,它保护了前面讨论的所有竞态条件,包括丢失更新和写者偏差。

Implementation of two-phase locking 两阶段锁的实现#

2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in Db2 [30].
可序列化隔离级别在 MySQL(InnoDB)和 SQL Server 中,以及 Db2 的重复读隔离级别使用 2PL [30]。

The blocking of readers and writers is implemented by having a lock on each object in the database. The lock can either be in shared mode or in exclusive mode (also known as a multi-reader single-writer lock). The lock is used as follows:
读者和写者的阻塞是通过在数据库中的每个对象上使用锁来实现的。该锁可以是共享模式或排他模式(也称为多读者单写者锁)。锁的使用方式如下:

  • If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.
    如果一个事务想要读取一个对象,它必须首先以共享模式获取锁。允许多个事务同时以共享模式持有锁,但如果另一个事务已经对该对象持有排他锁,这些事务必须等待。
  • If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No other transaction may hold the lock at the same time (either in shared or in exclusive mode), so if there is any existing lock on the object, the transaction must wait.
    如果一个事务想要写入一个对象,它必须首先以排他模式获取锁。其他事务不能同时持有该锁(无论是共享模式还是排他模式),所以如果对象上存在任何锁,该事务必须等待。
  • If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly.
    如果一个事务先读取再写入一个对象,它可以将共享锁升级为排他锁。锁的升级与直接获取排他锁相同。
  • After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort). This is where the name “two-phase” comes from: the first phase (while the transaction is executing) is when the locks are acquired, and the second phase (at the end of the transaction) is when all the locks are released.
    一个事务获取锁后,必须一直持有该锁直到事务结束(提交或中止)。这就是 “两阶段” 这个名称的来源:第一阶段(事务执行期间)是获取锁,第二阶段(事务结束时)是释放所有锁。

Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for transaction B to release its lock, and vice versa. This situation is called deadlock. The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.
由于使用了大量锁,很容易出现事务 A 等待事务 B 释放锁,而事务 B 又等待事务 A 释放锁的情况。这种情况称为死锁。数据库会自动检测事务之间的死锁,并中止其中一个事务,以便其他事务可以继续进行。被中止的事务需要由应用程序重新尝试。

Performance of two-phase locking 两阶段锁的性能#

The big downside of two-phase locking, and the reason why it hasn’t been the default for most systems since the 1970s, is performance: transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.
两阶段锁的缺点很大,这也是自 20 世纪 70 年代以来它没有成为大多数系统的默认选项的原因:性能:在两阶段锁下,交易的吞吐量和查询的响应时间比在弱隔离下要差得多。

This is partly due to the overhead of acquiring and releasing all those locks, but more importantly due to reduced concurrency. By design, if two concurrent transactions try to do anything that may in any way result in a race condition, one has to wait for the other to complete.
这部分是由于获取和释放所有这些锁的开销,但更重要的是并发性降低。按设计,如果两个并发交易尝试做任何可能导致竞争条件的事情,其中一个必须等待另一个完成。

For example, if you have a transaction that needs to read an entire table (e.g. a backup, analytics query, or integrity check, as discussed in “Snapshot Isolation and Repeatable Read”), that transaction has to take a shared lock on the entire table. Therefore, the reading transaction first has to wait until all in-progress transactions writing to that table have completed; then, while the whole table is being read (which may take a long time on a large table), all other transactions that want to write to that table are blocked until the big read-only transaction commits. In effect, the database becomes unavailable for writes for an extended time.
例如,如果你有一个需要读取整个表的交易(例如备份、分析查询或完整性检查,如 “快照隔离和可重复读” 中所述),该交易必须对整个表获取共享锁。因此,读取交易首先必须等待所有正在写入该表的交易完成;然后,在读取整个表期间(对于大表来说这可能需要很长时间),所有其他想要写入该表的交易都会被阻塞,直到大的只读交易提交。实际上,数据库在一段时间内无法进行写入。

For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at high percentiles (see “Describing Performance”) if there is contention in the workload. It may take just one slow transaction, or one transaction that accesses a lot of data and acquires many locks, to cause the rest of the system to grind to a halt.
正因如此,运行 2PL 的数据库可能具有相当不稳定的延迟,如果在工作负载中存在争用,它们在高百分位数时可能会非常慢(参见 “描述性能”)。可能只需要一个慢速事务,或者一个访问大量数据并获取许多锁的事务,就能导致整个系统陷入停滞。

Although deadlocks can happen with the lock-based read committed isolation level, they occur much more frequently under 2PL serializable isolation (depending on the access patterns of your transaction). This can be an additional performance problem: when a transaction is aborted due to deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can mean significant wasted effort.
尽管在基于锁的读已提交隔离级别下可能会发生死锁,但在 2PL 可序列化隔离级别下发生的频率要高得多(这取决于事务的访问模式)。这可能是一个额外的性能问题:当事务因死锁而中止并重试时,它需要重新做所有工作。如果死锁频繁发生,这可能意味着大量的浪费努力。

Predicate locks 谓词锁#

In the preceding description of locks, we glossed over a subtle but important detail. In “Phantoms causing write skew” we discussed the problem of phantoms —that is, one transaction changing the results of another transaction’s search query. A database with serializable isolation must prevent phantoms.
在前面对锁的描述中,我们忽略了一个微妙但重要细节。在 “幻影导致写偏差” 中,我们讨论了幻影的问题,即一个事务改变了另一个事务搜索查询的结果。具有可序列化隔离级别的数据库必须防止幻影。

In the meeting room booking example this means that if one transaction has searched for existing bookings for a room within a certain time window (see Example 8-2), another transaction is not allowed to concurrently insert or update another booking for the same room and time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a different time that doesn’t affect the proposed booking.)
在会议室预订示例中,这意味着如果一个事务在特定时间窗口内搜索了现有预订(参见示例 8-2),另一个事务就不允许同时插入或更新同一房间和时间段内的另一个预订。(同时插入其他房间的预订,或同一房间但在不影响提议预订的不同时间段的预订是可以的。)

How do we implement this? Conceptually, we need a predicate lock [4]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition, such as:
我们如何实现这一点?概念上,我们需要一个谓词锁 [4]。它的工作方式类似于前面描述的共享 / 独占锁,但它属于所有匹配某些搜索条件(例如:)的对象,而不是特定的对象(例如,表中的一行)。

A predicate lock restricts access as follows:
谓词锁限制访问的方式如下:

  • If transaction A wants to read objects matching some condition, like in that SELECT query, it must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B currently has an exclusive lock on any object matching those conditions, A must wait until B releases its lock before it is allowed to make its query.
    如果事务 A 想要读取满足某些条件(如 SELECT 查询中的条件)的对象,它必须获取查询条件的共享模式谓词锁。如果另一个事务 B 当前对满足这些条件中的任何对象持有独占锁,A 必须等待 B 释放锁后才能执行其查询。
  • If transaction A wants to insert, update, or delete any object, it must first check whether either the old or the new value matches any existing predicate lock. If there is a matching predicate lock held by transaction B, then A must wait until B has committed or aborted before it can continue.
    如果事务 A 想要插入、更新或删除任何对象,它必须首先检查旧值或新值是否与任何现有的谓词锁匹配。如果事务 B 持有匹配的谓词锁,那么 A 必须等待 B 提交或中止后才能继续。

The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.
这里的要点是谓词锁适用于数据库中尚未存在但未来可能添加的对象(幽灵)。如果两阶段锁包括谓词锁,数据库将防止所有形式的写偏斜和其他竞争条件,因此其隔离级别变为可串行化。

Index-range locks 索引范围锁#

Unfortunately, predicate locks do not perform well: if there are many locks by active transactions, checking for matching locks becomes time-consuming. For that reason, most databases with 2PL actually implement index-range locking (also known as next-key locking), which is a simplified approximation of predicate locking [55,65].
不幸的是,谓词锁性能不佳:如果有许多由活动事务产生的锁,检查匹配的锁会变得耗时。因此,大多数使用两阶段锁(2PL)的数据库实际上实现了索引范围锁(也称为下一键锁),这是一种谓词锁的简化近似 [55, 65]。

It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just room 123) between noon and 1 p.m. This is safe because any write that matches the original predicate will definitely also match the approximations.
通过使其匹配更大范围的对象来简化谓词是安全的。例如,如果你有一个谓词锁,用于锁定 123 房间中午到下午 1 点之间的预订,你可以通过锁定 123 房间任何时间的预订来近似它,或者通过锁定中午到下午 1 点之间所有房间(而不仅仅是 123 房间)来近似它。这是安全的,因为任何匹配原始谓词的写入操作也一定会匹配这些近似。

In the room bookings database you would probably have an index on the room_id column, and/or indexes on start_time and end_time (otherwise the preceding query would be very slow on a large database):
在房间预订数据库中,你可能会在 room_id 列上创建一个索引,和 / 或在 start_timeend_time 上创建索引(否则前面的查询在大型数据库上会非常慢):

  • Say your index is on room_id, and the database uses this index to find existing bookings for room 123. Now the database can simply attach a shared lock to this index entry, indicating that a transaction has searched for bookings of room 123.
    假设你的索引是在 room_id 上,数据库使用这个索引来查找房间 123 的现有预订。现在数据库可以简单地在这个索引条目上附加一个共享锁,表明一个事务已经搜索过房间 123 的预订。
  • Alternatively, if the database uses a time-based index to find existing bookings, it can attach a shared lock to a range of values in that index, indicating that a transaction has searched for bookings that overlap with the time period of noon to 1 p.m. on January 1, 2025.
    或者,如果数据库使用基于时间的索引来查找现有预订,它可以附加一个共享锁到该索引中某个值范围上,表示一个事务已经搜索过与 2025 年 1 月 1 日中午 12 点到下午 1 点时间段重叠的预订。

Either way, an approximation of the search condition is attached to one of the indexes. Now, if another transaction wants to insert, update, or delete a booking for the same room and/or an overlapping time period, it will have to update the same part of the index. In the process of doing so, it will encounter the shared lock, and it will be forced to wait until the lock is released.
无论哪种方式,搜索条件的近似值都会附加到其中一个索引上。现在,如果另一个事务想要插入、更新或删除同一房间和 / 或重叠时间段内的预订,它将不得不更新索引的同一部分。在这个过程中,它会遇到共享锁,并且被迫等待直到锁被释放。

This provides effective protection against phantoms and write skew. Index-range locks are not as precise as predicate locks would be (they may lock a bigger range of objects than is strictly necessary to maintain serializability), but since they have much lower overheads, they are a good compromise.
这提供了有效的保护,防止出现幻影和写入偏差。索引范围锁不如谓词锁精确(它们可能会锁定比确保可串行化所必需的更广的对象范围),但由于它们的开销要低得多,因此它们是一个很好的折中方案。

If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table. This will not be good for performance, since it will stop all other transactions writing to the table, but it’s a safe fallback position.
如果没有合适的索引可以附加范围锁,数据库可以回退到对整个表使用共享锁。这对性能不利,因为它会阻止所有其他事务向表中写入,但它是一个安全的回退方案。

Serializable Snapshot Isolation (SSI) 可序列化快照隔离 (SSI)#

This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we have implementations of serializability that don’t perform well (two-phase locking) or don’t scale well (serial execution). On the other hand, we have weak isolation levels that have good performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are serializable isolation and good performance fundamentally at odds with each other?
本章描绘了数据库并发控制黯淡的图景。一方面,我们有性能不佳的可序列化实现(两阶段锁)或扩展性不佳的实现(串行执行)。另一方面,我们有性能良好但易受各种竞态条件(丢失更新、写偏差、幻影等)影响的弱隔离级别。可序列化隔离和良好性能从根本上相互矛盾吗?

It seems not: an algorithm called serializable snapshot isolation (SSI) provides full serializability with only a small performance penalty compared to snapshot isolation. SSI is comparatively new: it was first described in 2008 [54,66].
似乎并非如此:一种名为可序列化快照隔离(SSI)的算法提供了与快照隔离相比仅有微小性能损失的完全可序列化。SSI 相对较新:它首次于 2008 年被描述 [54, 66]。

Today SSI and similar algorithms are used in single-node databases (the serializable isolation level in PostgreSQL [55], SQL Server’s In-Memory OLTP/Hekaton [67], and HyPer [68]), distributed databases (CockroachDB [5] and FoundationDB [8]), and embedded storage engines such as BadgerDB.
如今,SSI 和类似算法被用于单节点数据库(PostgreSQL 中的可序列化隔离级别 [55]、SQL Server 的 In-Memory OLTP / Hekaton [67] 和 HyPer [68])、分布式数据库(CockroachDB [5] 和 FoundationDB [8]),以及嵌入式存储引擎如 BadgerDB。

Pessimistic versus optimistic concurrency control 悲观与乐观并发控制#

Two-phase locking is a so-called pessimistic concurrency control mechanism: it is based on the principle that if anything might possibly go wrong (as indicated by a lock held by another transaction), it’s better to wait until the situation is safe again before doing anything. It is like mutual exclusion, which is used to protect data structures in multi-threaded programming.
两阶段锁是一种所谓的悲观并发控制机制:它基于这样的原则,如果可能发生任何问题(如另一个事务持有的锁所指示),最好在情况再次安全之前等待。这就像互斥,用于保护多线程编程中的数据结构。

Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each transaction having an exclusive lock on the entire database (or one shard of the database) for the duration of the transaction. We compensate for the pessimism by making each transaction very fast to execute, so it only needs to hold the “lock” for a short time.
串行执行在某种程度上是极度悲观的:它本质上等同于每个事务在整个数据库(或数据库的一个分片)上持有排他锁,直到事务结束。我们通过使每个事务执行非常快来补偿这种悲观,因此它只需要 “持有锁” 很短的时间。

By contrast, serializable snapshot isolation is an optimistic concurrency control technique. Optimistic in this context means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions that executed serializably are allowed to commit.
相比之下,可串行化快照隔离是一种乐观并发控制技术。这里的乐观意味着,如果发生潜在的危险情况,事务不会阻塞,而是继续进行,希望一切都能顺利。当事务想要提交时,数据库会检查是否发生了什么坏事(即是否违反了隔离性);如果是这样,事务会被中止,并且必须重试。只有那些可串行化执行的事务才被允许提交。

Optimistic concurrency control is an old idea [69], and its advantages and disadvantages have been debated for a long time [70]. It performs badly if there is high contention (many transactions trying to access the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make performance worse.
乐观并发控制是一个旧概念 [69],其优缺点已经争论了很长时间 [70]。如果存在高竞争(许多事务尝试访问相同对象),其表现会很差,因为这会导致大量事务需要中止。如果系统已经接近最大吞吐量,重试事务带来的额外负载可能会使性能更差。

However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention can be reduced with commutative atomic operations: for example, if several transactions concurrently want to increment a counter, it doesn’t matter in which order the increments are applied (as long as the counter isn’t read in the same transaction), so the concurrent increments can all be applied without conflicting.
然而,如果备用容量充足,且事务间的竞争不高,乐观并发控制技术往往比悲观技术表现更好。通过交换原子操作可以减少竞争:例如,如果多个事务同时想要增加一个计数器,只要计数器不在同一事务中读取,增加的顺序无关紧要,因此这些并发增加可以全部应用而不冲突。

As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction are made from a consistent snapshot of the database (see “Snapshot Isolation and Repeatable Read”). On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among reads and writes, and determining which transactions to abort.
正如其名称所示,SSI 基于快照隔离 —— 也就是说,事务中的所有读取操作都是从数据库的一致性快照中进行的(参见 “快照隔离与可重复读”)。在快照隔离的基础上,SSI 增加了一种算法,用于检测读写操作之间的序列化冲突,并确定哪些事务需要中止。

Decisions based on an outdated premise 基于过时前提的决策#

When we previously discussed write skew in snapshot isolation (see “Write Skew and Phantoms”), we observed a recurring pattern: a transaction reads some data from the database, examines the result of the query, and decides to take some action (write to the database) based on the result that it saw. However, under snapshot isolation, the result from the original query may no longer be up-to-date by the time the transaction commits, because the data may have been modified in the meantime.
当我们之前讨论快照隔离中的写倾斜(参见 “写倾斜与幻影”)时,我们观察到一种反复出现的模式:事务从数据库读取一些数据,检查查询结果,并根据所见结果决定采取某些行动(写入数据库)。然而,在快照隔离下,当事务提交时,原始查询的结果可能已经不再是最新的,因为数据可能在此期间已被修改。

Put another way, the transaction is taking an action based on a premise (a fact that was true at the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the transaction wants to commit, the original data may have changed—the premise may no longer be true.
换句话说,事务基于一个前提条件执行操作(该条件在事务开始时为真,例如 “目前有两位医生值班”)。后来,当事务尝试提交时,原始数据可能已经发生变化 —— 前提条件可能不再为真。

When the application makes a query (e.g., “How many doctors are currently on call?”), the database doesn’t know how the application logic uses the result of that query. To be safe, the database needs to assume that any change in the query result (the premise) means that writes in that transaction may be invalid. In other words, there may be a causal dependency between the queries and the writes in the transaction. In order to provide serializable isolation, the database must detect situations in which a transaction may have acted on an outdated premise and abort the transaction in that case.
当应用程序发起查询(例如 “目前有多少位医生值班?”)时,数据库不知道应用程序逻辑如何使用该查询的结果。为了安全起见,数据库需要假设查询结果(前提条件)的任何变化都可能导致该事务中的写入操作无效。换句话说,查询和事务中的写入操作之间可能存在因果关系。为了提供可串行化隔离,数据库必须检测到事务可能基于过时的前提条件执行操作的情况,并在这种情况下中止事务。

How does the database know if a query result might have changed? There are two cases to consider:
数据库如何知道查询结果是否可能已更改?需要考虑两种情况:

  • Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
    检测到陈旧的 MVCC 对象版本的读取(读取操作之前发生了未提交的写入)
  • Detecting writes that affect prior reads (the write occurs after the read)
    检测影响先前读取的写入(写入发生在读取之后)

Detecting stale MVCC reads 检测陈旧的 MVCC 读取#

Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC; see “Multi-version concurrency control (MVCC)”). When a transaction reads from a consistent snapshot in an MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed at the time when the snapshot was taken.
回想一下,快照隔离通常是通过多版本并发控制(MVCC;参见 “多版本并发控制(MVCC)” 实现的)。当事务在 MVCC 数据库中从一致快照读取时,它会忽略在快照创建时尚未提交的任何其他事务所做的写入。

In Figure 8-10, transaction 43 sees Aaliyah as having on_call = true, because transaction 42 (which modified Aaliyah’s on-call status) is uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already committed. This means that the write that was ignored when reading from the consistent snapshot has now taken effect, and transaction 43’s premise is no longer true. Things get even more complicated when a writer inserts data that didn’t exist before (see “Phantoms causing write skew”). We’ll discuss detecting phantom writes for SSI in “Detecting writes that affect prior reads”.
在图 8-10 中,事务 43 将 Aaliyah 视为拥有 on_call = true ,因为事务 42(修改了 Aaliyah 的值班状态)尚未提交。然而,当事务 43 想要提交时,事务 42 已经提交了。这意味着在从一致性快照读取时被忽略的写入现在已经生效,事务 43 的前提不再成立。当写入者插入之前不存在的数据时(参见 “幻影导致写入偏差”),情况会更加复杂。我们将在 “检测影响先前读取的写入” 中讨论如何检测 SSI 的幻影写入。

ddia 0810

In order to prevent this anomaly, the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted.
为了防止这种异常情况,数据库需要跟踪事务由于 MVCC(多版本并发控制)可见性规则而忽略其他事务的写操作的时间。当事务想要提交时,数据库检查是否有任何被忽略的写操作已经被提交。如果是这样,该事务必须被中止。

Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected? Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot isolation’s support for long-running reads from a consistent snapshot.
为什么非要等到提交时才处理?当检测到过时读取时,为什么不能立即中止事务 43?嗯,如果事务 43 是一个只读事务,那么它就不需要被中止,因为没有写偏差的风险。当事务 43 进行读取时,数据库还不知道该事务是否会之后执行写操作。此外,在事务 43 提交时,事务 42 可能已经中止,或者仍然未提交,因此这次读取可能根本不是过时的。通过避免不必要的中止,SSI(可串行化一致性隔离)保留了快照隔离对从一致性快照进行长时间读取的支持。

Detecting writes that affect prior reads 检测影响先前读取的写操作#

The second case to consider is when another transaction modifies data after it has been read. This case is illustrated in Figure 8-11.
需要考虑的第二个情况是,当另一个事务在数据被读取后对其进行修改时。这种情况如图 8-11 所示。

ddia 0811

In the context of two-phase locking we discussed index-range locks (see “Index-range locks”), which allow the database to lock access to all rows matching some search query, such as WHERE shift_id = 1234. We can use a similar technique here, except that SSI locks don’t block other transactions.
在两阶段锁定的上下文中,我们讨论了索引范围锁(参见 “索引范围锁”),它允许数据库锁定所有匹配某些搜索查询的行,例如 WHERE shift_id = 1234 。我们可以使用类似的技术,只是 SSI 锁不会阻塞其他事务。

In Figure 8-11, transactions 42 and 43 both search for on-call doctors during shift 1234. If there is an index on shift_id, the database can use the index entry 1234 to record the fact that transactions 42 and 43 read this data. (If there is no index, this information can be tracked at the table level.) This information only needs to be kept for a while: after a transaction has finished (committed or aborted), and all concurrent transactions have finished, the database can forget what data it read.
在图 8-11 中,事务 42 和 43 都在 1234 班次期间搜索值班医生。如果 shift_id 上有索引,数据库可以使用索引条目 1234 来记录事务 42 和 43 读取了这些数据。(如果没有索引,这些信息可以在表级别进行跟踪。)这些信息只需要保留一段时间:当事务完成(提交或中止)后,并且所有并发事务都完成后,数据库可以忘记它读取了哪些数据。

When a transaction writes to the database, it must look in the indexes for any other transactions that have recently read the affected data. This process is similar to acquiring a write lock on the affected key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: it simply notifies the transactions that the data they read may no longer be up to date.
当事务写入数据库时,它必须检查索引中是否有其他事务最近读取了受影响的数据。这个过程类似于对受影响的关键字范围获取写锁,但与等待读取者提交后才阻塞不同,这个锁充当了一个触发器:它只是通知事务它们读取的数据可能不再是最新的。

In Figure 8-11, transaction 43 notifies transaction 42 that its prior read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect. However, when transaction 43 wants to commit, the conflicting write from 42 has already been committed, so 43 must abort.
在图 8-11 中,事务 43 通知事务 42 它的先前读取已经过时,反之亦然。事务 42 是第一个提交的,并且成功:尽管事务 43 的写入影响了 42,但 43 尚未提交,所以写入尚未生效。然而,当事务 43 想要提交时,来自 42 的冲突写入已经被提交,所以 43 必须中止。

Performance of serializable snapshot isolation 可串行化快照隔离的性能#

As always, many engineering details affect how well an algorithm works in practice. For example, one trade-off is the granularity at which transactions’ reads and writes are tracked. If the database keeps track of each transaction’s activity in great detail, it can be precise about which transactions need to abort, but the bookkeeping overhead can become significant. Less detailed tracking is faster, but may lead to more transactions being aborted than strictly necessary.
和往常一样,许多工程细节会影响算法在实际中的表现。例如,一个权衡点是跟踪事务读写操作的粒度。如果数据库详细跟踪每个事务的活动,它可以精确地确定哪些事务需要中止,但账目管理开销可能会变得很大。跟踪细节较少则更快,但可能会导致比严格必要的中止更多的事务。

In some cases, it’s okay for a transaction to read information that was overwritten by another transaction: depending on what else happened, it’s sometimes possible to prove that the result of the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of unnecessary aborts [14,55].
在某些情况下,允许事务读取被另一个事务覆盖的信息是可行的:根据其他发生的情况,有时可以证明执行结果仍然是可串行化的。PostgreSQL 使用这一理论来减少不必要的中止次数 [14, 55]。

Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa. This design principle makes query latency much more predictable and less variable. In particular, read-only queries can run on a consistent snapshot without requiring any locks, which is very appealing for read-heavy workloads.
与两阶段锁相比,可串行化快照隔离的主要优势在于一个事务不需要阻塞等待另一个事务持有的锁。与快照隔离一样,写操作不会阻塞读操作,反之亦然。这一设计原则使得查询延迟更加可预测且变化较小。特别是,只读查询可以在一致的快照上运行,而无需任何锁,这对于读密集型工作负载非常有吸引力。

Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a single CPU core: for example, FoundationDB distributes the detection of serialization conflicts across multiple machines, allowing it to scale to very high throughput. Even though data may be sharded across multiple machines, transactions can read and write data in multiple shards while ensuring serializable isolation.
与串行执行相比,可串行化快照隔离不受单个 CPU 核心吞吐量的限制:例如,FoundationDB 将序列化冲突的检测分布到多台机器上,使其能够扩展到非常高的吞吐量。尽管数据可能跨多台机器分片,但事务可以在多个分片中读取和写入数据,同时确保可串行化隔离。

Compared to non-serializable snapshot isolation, the need to check for serializability violations introduces some performance overheads. How significant these overheads are is a matter of debate: some believe that serializability checking is not worth it [71], while others believe that the performance of serializability is now so good that there is no need to use the weaker snapshot isolation any more [68].
与非可串行化快照隔离相比,需要检查可串行化违规会引入一些性能开销。这些开销有多大是一个有争议的问题:一些人认为检查可串行化不值得 [71],而另一些人则认为可串行化的性能现在已经很好,不再需要使用较弱的快照隔离 [68]。

The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so SSI requires that read-write transactions be fairly short (long-running read-only transactions are okay). However, SSI is less sensitive to slow transactions than two-phase locking or serial execution.
中止率显著影响 SSI 的整体性能。例如,长时间读写数据的交易很可能发生冲突并中止,因此 SSI 要求读写交易相对较短(长时间只读交易是可以的)。然而,与两阶段锁定或串行执行相比,SSI 对缓慢交易不太敏感。

Distributed Transactions 分布式事务#

The last few sections have focused on concurrency control for isolation, the I in ACID. The algorithms we have seen apply to both single-node and distributed databases: although there are challenges in making concurrency control algorithms scalable (for example, performing distributed serializability checking for SSI), the high-level ideas for distributed concurrency control are similar to single-node concurrency control [8].
最后几节主要关注了隔离的并发控制,即 ACID 中的 I。我们所看到的算法适用于单节点和分布式数据库:尽管在使并发控制算法可扩展性方面存在挑战(例如,为 SSI 执行分布式可串行化检查),但分布式并发控制的高级思想与单节点并发控制相似 [8]。

Consistency and durability also don’t change much when we move to distributed transactions. However, atomicity requires more care.
当转向分布式事务时,一致性和持久性变化不大。然而,原子性需要更多关注。

For transactions that execute at a single database node, atomicity is commonly implemented by the storage engine. When the client asks the database node to commit the transaction, the database makes the transaction’s writes durable (typically in a write-ahead log; see “Making B-trees reliable”) and then appends a commit record to the log on disk. If the database crashes in the middle of this process, the transaction is recovered from the log when the node restarts: if the commit record was successfully written to disk before the crash, the transaction is considered committed; if not, any writes from that transaction are rolled back.
对于在单个数据库节点上执行的交易,原子性通常由存储引擎实现。当客户端请求数据库节点提交交易时,数据库使交易写入持久化(通常通过预写日志;参见 “使 B 树可靠”),然后在磁盘上的日志中追加一条提交记录。如果数据库在提交过程中崩溃,节点重启时将从日志中恢复交易:如果提交记录在崩溃前成功写入磁盘,则交易被视为已提交;如果没有,则该交易的所有写入将被回滚。

Thus, on a single node, transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record [22]. The key deciding moment for whether the transaction commits or aborts is the moment at which the disk finishes writing the commit record: before that moment, it is still possible to abort (due to a crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.
因此,在单个节点上,事务提交的关键取决于数据持久化写入磁盘的顺序:先写入数据,然后写入提交记录 [22]。决定事务是提交还是中止的关键时刻是磁盘完成写入提交记录的时刻:在此之前,仍然有可能中止(由于崩溃),但在此之后,事务即被提交(即使数据库崩溃)。因此,是单个设备(即连接到特定节点的特定磁盘驱动器的控制器)使得提交具有原子性。

However, what if multiple nodes are involved in a transaction? For example, perhaps you have a multi-object transaction in a sharded database, or a global secondary index (in which the index entry may be on a different node from the primary data; see “Sharding and Secondary Indexes”). Most “NoSQL” distributed datastores do not support such distributed transactions, but various distributed relational databases do.
然而,如果多个节点参与事务会怎样?例如,也许你有一个分片数据库中的多对象事务,或者是一个全局二级索引(其中索引条目可能位于与主数据不同的节点上;参见 “分片和二级索引”)。大多数 “NoSQL” 分布式数据存储不支持此类分布式事务,但各种分布式关系数据库则支持。

In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. It could easily happen that the commit succeeds on some nodes and fails on other nodes, as shown in Figure 8-12:
在这种情况下,仅仅向所有节点发送提交请求并在每个节点上独立提交事务是不够的。如图 8-12 所示,很容易发生提交在某些节点上成功而在其他节点上失败的情况。

  • Some nodes may detect a constraint violation or conflict, making an abort necessary, while other nodes are successfully able to commit.
    有些节点可能会检测到约束违规或冲突,需要中止,而其他节点则成功提交。
  • Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.
    部分提交请求可能会在网络中丢失,最终因超时而中止,而其他提交请求则成功通过。
  • Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.
    有些节点可能在提交记录完全写入之前崩溃,并在恢复时回滚,而其他节点则成功提交。
    ddia 0812

If some nodes commit the transaction but others abort it, the nodes become inconsistent with each other. And once a transaction has been committed on one node, it cannot be retracted again if it later turns out that it was aborted on another node. This is because once data has been committed, it becomes visible to other transactions under read committed or stronger isolation. For example, in Figure 8-12, by the time user 1 notices that its commit failed on database 1, user 2 has already read the data from the same transaction on database 2. If user 1’s transaction was later aborted, user 2’s transaction would have to be reverted as well, since it was based on data that was retroactively declared not to have existed.
如果某些节点提交了交易而其他节点中止了它,这些节点会变得彼此不一致。并且一旦在某个节点上提交了交易,如果后来发现它在另一个节点上被中止了,就再也无法撤回。这是因为一旦数据被提交,它就会在可重复读或更强的隔离级别下对其他交易可见。例如,在图 8-12 中,当用户 1 发现它在数据库 1 上的提交失败时,用户 2 已经从数据库 2 的同一交易中读取了数据。如果用户 1 的交易后来被中止,用户 2 的交易也必须被撤销,因为它基于的数据被事后宣布不存在。

A better approach is to ensure that the nodes involved in a transaction either all commit or all abort, and to prevent a mixture of the two. Ensuring this is known as the atomic commitment problem.
更好的方法是确保参与事务的节点要么全部提交,要么全部中止,并防止两者混合。确保这一点被称为原子提交问题。

Two-Phase Commit (2PC) 两阶段提交(2PC)#

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It is a classic algorithm in distributed databases [13,72,73]. 2PC is used internally in some databases and also made available to applications in the form of XA transactions [74] (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP web services [75,76].
两阶段提交是一种跨多个节点实现原子事务提交的算法。它是分布式数据库中的经典算法 [13, 72, 73]。2PC 在某些数据库内部使用,也以 XA 事务的形式提供给应用程序(例如,Java 事务 API 支持这些事务)或通过 WS-AtomicTransaction 为 SOAP 网络服务提供 [75, 76]。

The basic flow of 2PC is illustrated in Figure 8-13. Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).
2PC 的基本流程如图 8-13 所示。与单节点事务使用单个提交请求不同,2PC 的提交 / 中止过程被分为两个阶段(因此得名)。

ddia 0813

2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager). The coordinator is often implemented as a library within the same application process that is requesting the transaction (e.g., embedded in a Java EE container), but it can also be a separate process or service. Examples of such coordinators include Narayana, JOTM, BTM, or MSDTC.
2PC 使用了一个在单节点事务中通常不出现的组件:协调器(也称为事务管理器)。协调器通常作为请求事务的同一应用进程内的库实现(例如嵌入在 Java EE 容器中),但它也可以是单独的进程或服务。这类协调器的例子包括 Narayana、JOTM、BTM 或 MSDTC。

When 2PC is used, a distributed transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:
当使用 2PC 时,分布式事务开始于应用程序在多个数据库节点上读取和写入数据,与正常情况相同。我们将这些数据库节点称为事务的参与者。当应用程序准备提交时,协调器开始第一阶段:它向每个节点发送一个准备请求,询问它们是否能够提交。然后协调器跟踪参与者的响应:

  • If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.
    如果所有参与者都回复 “是”,表示他们准备提交,那么协调者在阶段 2 发送提交请求,并实际进行提交。
  • If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.
    如果任何参与者回复 “否”,协调者在阶段 2 向所有节点发送中止请求。

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister asks the bride and groom individually whether each wants to marry the other, and typically receives the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted [77].
这个过程有点像西方文化中的传统婚礼仪式:牧师分别询问新娘和新郎是否愿意结婚,通常从两人那里都得到 “我愿意” 的答案。在收到双方的确认后,牧师宣布这对夫妇为夫妻:事务被提交,这个幸福的事实被广播给所有与会者。如果新娘或新郎任何一方没有说 “是”,仪式将被中止 [77]。

A system of promises 一个承诺系统#

From this short description it might not be clear why two-phase commit ensures atomicity, while one-phase commit across several nodes does not. Surely the prepare and commit requests can just as easily be lost in the two-phase case. What makes 2PC different?
从这段简短描述中可能无法清楚地理解为什么两阶段提交能确保原子性,而跨多个节点的单阶段提交却不能。当然,在两阶段提交的情况下,准备和提交请求同样可能丢失。是什么让 2PC 有所不同?

To understand why it works, we have to break down the process in a bit more detail:
要理解它为何有效,我们需要更详细地分解这个过程:

  1. When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.
    当应用程序想要开始一个分布式事务时,它会向协调器请求一个事务 ID。这个事务 ID 是全局唯一的。
  2. The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage (for example, a node crashes or a request times out), the coordinator or any of the participants can abort.
    应用程序在每个参与者上开始一个单节点事务,并将全局唯一的事务 ID 附加到单节点事务上。所有读和写操作都在这些单节点事务中完成。如果在这一阶段出现任何问题(例如,某个节点崩溃或请求超时),协调器或任何参与者都可以中止。
  3. When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.
    当应用程序准备提交时,协调者向所有参与者发送带有全局事务 ID 的 prepare 请求。如果这些请求中的任何一个失败或超时,协调者会向所有参与者发送针对该事务 ID 的 abort 请求。
  4. When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances.This includes writing all transaction data to disk (a crash, a power failure, or running out of disk space is not an acceptable excuse for refusing to commit later), and checking for any conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it.
    当参与者收到 prepare 请求时,它会确保在任何情况下都能确定性地提交事务。这包括将所有事务数据写入磁盘(崩溃、电源故障或磁盘空间不足都不能成为后来拒绝提交的借口),并检查任何冲突或约束违规。通过向协调者回复 “yes”,节点承诺如果被要求,将无错误地提交事务。换句话说,参与者放弃了中止事务的权利,但实际上并未提交它。
  5. When the coordinator has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction (committing only if all participants voted “yes”). The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point.
    当协调者收到所有准备请求的响应后,它会就提交或中止事务做出最终决定(只有所有参与者都投票 “是” 时才提交)。协调者必须将其决定写入磁盘上的事务日志中,以便在随后崩溃时知道自己做出了何种决定。这被称为提交点。
  6. Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes,” it cannot refuse to commit when it recovers.
    一旦协调者的决定写入磁盘,提交或中止请求将被发送给所有参与者。如果此请求失败或超时,协调者必须无限重试直到成功。无法回头:如果决定是提交,必须强制执行该决定,无论需要多少次重试。如果在此期间参与者崩溃,事务将在其恢复时提交 —— 由于参与者投票 “是”,在恢复时无法拒绝提交。

Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit record to the transaction log.)
因此,该协议包含两个关键的 “不可逆转点”:当一个参与者投票 “是” 时,它承诺之后一定能够提交(尽管协调者仍可能选择中止);一旦协调者做出决定,该决定就不可撤销。这些承诺确保了 2PC 的原子性。(单节点原子提交将这些两个事件合并为一个:将提交记录写入事务日志。)

Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the minister speak the words “You are now husband and wife,” that doesn’t change the fact that the transaction was committed. When you recover consciousness later, you can find out whether you are married or not by querying the minister for the status of your global transaction ID, or you can wait for the minister’s next retry of the commit request (since the retries will have continued throughout your period of unconsciousness).
回到婚姻的类比,在说 “我愿意” 之前,你和你的新娘 / 新郎都有自由通过说 “不行!”(或类似的话)来中止事务。然而,一旦说 “我愿意”,你就无法撤回该声明。如果你在说 “我愿意” 后晕倒,并且没有听到牧师说出 “你们现在结为夫妻” 的词,这并不会改变事务已经提交的事实。当你后来恢复意识时,可以通过向牧师查询你的全局事务 ID 的状态来得知自己是否已婚,或者可以等待牧师对提交请求的下次重试(因为重试将在你失去意识期间持续进行)。

Coordinator failure 协调者故障#

We have discussed what happens if one of the participants or the network fails during 2PC: if any of the prepare requests fails or times out, the coordinator aborts the transaction; if any of the commit or abort requests fails, the coordinator retries them indefinitely. However, it is less clear what happens if the coordinator crashes.
我们讨论了在 2PC 过程中,如果参与者或网络出现故障会发生什么:如果任何准备请求失败或超时,协调者会中止事务;如果任何提交或中止请求失败,协调者会无限重试。然而,如果协调者崩溃,情况就不那么清楚了。

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain.
如果协调者在发送准备请求之前就失败了,参与者可以安全地中止事务。但是一旦参与者收到准备请求并投票 “是”,它就不能单方面中止 —— 它必须等待协调者的回复,以确定事务是被提交还是中止。如果此时协调者崩溃或网络失败,参与者只能等待。处于这种状态的参与者事务被称为悬而未决或不确定。

The situation is illustrated in Figure 8-14. In this particular example, the coordinator actually decided to commit, and database 2 received the commit request. However, the coordinator crashed before it could send the commit request to database 1, and so database 1 does not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, it is not safe to unilaterally commit, because another participant may have aborted.
情况如图 8-14 所示。在这个特定示例中,协调者实际上决定提交,数据库 2 收到了提交请求。然而,协调者在向数据库 1 发送提交请求之前崩溃了,因此数据库 1 不知道是提交还是中止。即使超时也无法解决这个问题:如果数据库 1 在超时后单方面中止,最终会与已经提交的数据库 2 不一致。类似地,单方面提交也不安全,因为另一个参与者可能已经中止了。

ddia 0814

Without hearing from the coordinator, the participant has no way of knowing whether to commit or abort. In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.
在没有收到协调者的消息的情况下,参与者无法知道是提交还是中止。原则上,参与者之间可以相互通信,以了解每个参与者投票的结果并达成某种共识,但这不属于 2PC 协议的一部分。

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants: when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.
2PC 只能通过等待协调者恢复来完成。这就是为什么协调者必须在向参与者发送提交或中止请求之前,将其提交或中止决策写入磁盘上的事务日志:当协调者恢复时,它通过读取其事务日志来确定所有悬而未决事务的状态。协调者日志中没有任何提交记录的事务将被中止。因此,2PC 的提交点归结为协调者上的常规单节点原子提交。

Three-phase commit 三阶段提交#

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover. It is possible to make an atomic commit protocol nonblocking, so that it does not get stuck if a node fails. However, making this work in practice is not so straightforward.
由于 2PC 可能会因等待协调者恢复而卡住,因此两阶段提交被称为阻塞原子提交协议。可以将原子提交协议设计为非阻塞的,这样即使节点发生故障也不会卡住。然而,在实际中实现这一点并不那么简单。

As an alternative to 2PC, an algorithm called three-phase commit (3PC) has been proposed [13,78]. However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most practical systems with unbounded network delay and process pauses (see Chapter 9), it cannot guarantee atomicity.
作为 2PC 的替代方案,已经提出了一个称为三阶段提交(3PC)的算法 [13, 78]。然而,3PC 假设一个具有有界延迟的网络和具有有界响应时间的节点;在大多数具有无界网络延迟和处理暂停的实际系统中(见第 9 章),它无法保证原子性。

A better solution in practice is to replace the single-node coordinator with a fault-tolerant consensus protocol. We will see how to do this in Chapter 10.
实践中更好的解决方案是用容错共识协议替换单节点协调器。我们将在第 10 章中看到如何做到这一点。

Distributed Transactions Across Different Systems 跨不同系统的分布式事务#

Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are seen as providing an important safety guarantee that would be hard to achieve otherwise; on the other hand, they are criticized for causing operational problems, killing performance, and promising more than they can deliver [79,80,81,82]. Many cloud services choose not to implement distributed transactions due to the operational problems they engender [83].
分布式事务和两阶段提交有着复杂的声誉。一方面,它们被视为提供了一种重要的安全保证,否则很难实现;另一方面,它们因引发操作问题、降低性能以及承诺超出实际能力而受到批评 [79, 80, 81, 82]。许多云服务由于它们引发的运营问题而选择不实现分布式事务 [83]。

Some implementations of distributed transactions carry a heavy performance penalty. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery, and the additional network round-trips.
一些分布式事务的实现会带来严重的性能损失。两阶段提交中固有的大部分性能成本是由于需要额外的磁盘强制执行 ( fsync ) 来进行崩溃恢复以及额外的网络往返。

However, rather than dismissing distributed transactions outright, we should examine them in some more detail, because there are important lessons to be learned from them. To begin, we should be precise about what we mean by “distributed transactions.” Two quite different types of distributed transactions are often conflated:
然而,我们不应完全否定分布式事务,而应更详细地研究它们,因为从中可以学到重要的经验教训。首先,我们应该明确 “分布式事务” 的含义。通常会将两种截然不同的分布式事务类型混淆:

Database-internal distributed transactions
数据库内部分布式事务

Some distributed databases (i.e., databases that use replication and sharding in their standard configuration) support internal transactions among the nodes of that database. For example, YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB, and MySQL Cluster’s NDB storage engine have such internal transaction support. In this case, all the nodes participating in the transaction are running the same database software.
一些分布式数据库(即在其标准配置中使用复制和分片的数据库)支持数据库节点之间的内部事务。例如,YugabyteDB、TiDB、FoundationDB、Spanner、VoltDB 和 MySQL Cluster 的 NDB 存储引擎都支持此类内部事务。在这种情况下,参与事务的所有节点都在运行相同的数据库软件。

Heterogeneous distributed transactions
异构分布式事务

In a heterogeneous transaction, the participants are two or more different technologies: for example, two databases from different vendors, or even non-database systems such as message brokers. A distributed transaction across these systems must ensure atomic commit, even though the systems may be entirely different under the hood.
在异构事务中,参与者是两种或多种不同的技术:例如,来自不同供应商的两个数据库,甚至非数据库系统(如消息代理)。跨这些系统的分布式事务必须确保原子性提交,尽管这些系统在底层可能完全不同。

Database-internal transactions do not have to be compatible with any other system, so they can use any protocol and apply optimizations specific to that particular technology. For that reason, database-internal distributed transactions can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.
数据库内部事务无需与其他系统兼容,因此可以使用任何协议并应用特定于该技术的优化。正因如此,数据库内部的分布式事务通常可以运行得相当好。另一方面,跨越异构技术的交易则具有更大的挑战性。

Exactly-once message processing 精确一次消息处理#

Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For example, a message from a message queue can be acknowledged as processed if and only if the database transaction for processing the message was successfully committed. This is implemented by atomically committing the message acknowledgment and the database writes in a single transaction. With distributed transaction support, this is possible, even if the message broker and the database are two unrelated technologies running on different machines.
异构分布式事务允许以强大的方式集成不同的系统。例如,只有当处理消息的数据库事务成功提交时,来自消息队列的消息才会被确认已处理。这是通过在一个事务中原子性地提交消息确认和数据库写入来实现的。有了分布式事务支持,即使消息代理和数据库是两种无关的技术,运行在不同的机器上,这也是可能的。

If either the message delivery or the database transaction fails, both are aborted, and so the message broker may safely redeliver the message later. Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded. The abort discards any side effects of the partially completed transaction. This is known as exactly-once semantics.
如果消息传递或数据库事务失败,两者都会中止,因此消息代理可以安全地稍后重新传递消息。通过原子性地提交消息及其处理产生的副作用,我们可以确保消息被有效且仅处理一次,即使它需要几次重试才能成功。中止会丢弃部分完成的事务的任何副作用。这被称为精确一次语义。

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol, however. For example, say a side effect of processing a message is to send an email, and the email server does not support two-phase commit: it could happen that the email is sent two or more times if message processing fails and is retried. But if all side effects of processing a message are rolled back on transaction abort, then the processing step can safely be retried as if nothing had happened.
然而,这种分布式事务只有在受事务影响的所有系统都能使用相同的原子提交协议时才可能实现。例如,假设处理消息的一个副作用是发送电子邮件,而电子邮件服务器不支持两阶段提交:如果消息处理失败并重试,电子邮件可能会发送两次或更多次。但如果在事务中止时回滚处理消息的所有副作用,那么处理步骤可以安全地重试,就好像什么都没发生过一样。

We will return to the topic of exactly-once semantics later in this chapter. Let’s look first at the atomic commit protocol that allows such heterogeneous distributed transactions.
我们将在本章稍后回到精确一次语义的主题。首先,让我们看看允许这种异构分布式事务的原子提交协议。

XA transactions XA 事务#

X/Open XA (short for eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies [74]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).
X / Open XA(全称 eXtended Architecture)是一个跨异构技术实现两阶段提交的标准 [74]。该标准于 1991 年推出,已被广泛实现:XA 被许多传统关系型数据库(包括 PostgreSQL、MySQL、Db2、SQL Server 和 Oracle)和消息代理(包括 ActiveMQ、HornetQ、MSMQ 和 IBM MQ)所支持。

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.
XA 不是一个网络协议,它仅仅是一个用于与事务协调器交互的 C API。这个 API 在其他语言中也有绑定;例如,在 Java EE 应用程序的世界中,XA 事务是通过 Java 事务 API(JTA)实现的,而 JTA 又得到了许多使用 Java 数据库连接(JDBC)的数据库驱动程序和使用 Java 消息服务(JMS)API 的消息代理驱动程序的支撑。

XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. If the driver supports XA, that means it calls the XA API to find out whether an operation should be part of a distributed transaction—and if so, it sends the necessary information to the database server. The driver also exposes callbacks through which the coordinator can ask the participant to prepare, commit, or abort.
XA 假设您的应用程序使用网络驱动程序或客户端库与参与数据库或消息服务进行通信。如果驱动程序支持 XA,这意味着它会调用 XA API 来确定某个操作是否应成为分布式事务的一部分 —— 如果是,它会将必要信息发送到数据库服务器。驱动程序还通过回调暴露接口,协调器可以通过这些接口要求参与者准备、提交或中止。

The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). It keeps track of the participants in a transaction, collects partipants’ responses after asking them to prepare (via a callback into the driver), and uses a log on the local disk to keep track of the commit/abort decision for each transaction.
事务协调器实现了 XA API。标准并未规定其实现方式,但在实践中,协调器通常只是一个加载到发出事务的应用程序相同进程中的库(而不是一个独立的服务)。它跟踪事务的参与者,在请求参与者准备(通过驱动程序中的回调)后收集参与者的响应,并使用本地磁盘上的日志来跟踪每个事务的提交 / 中止决策。

If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction. Only then can the coordinator use the database driver’s XA callbacks to ask participants to commit or abort, as appropriate. The database server cannot contact the coordinator directly, since all communication must go via its client library.
如果应用程序进程崩溃,或者运行应用程序的机器死亡,协调者也会随之消失。此时,任何持有已准备但未提交事务的参与者将陷入不确定状态。由于协调者的日志存储在应用程序服务器的本地磁盘上,该服务器必须重启,并且协调者库必须读取日志以恢复每个事务的提交 / 中止结果。只有到那时,协调者才能使用数据库驱动的 XA 回调函数,根据情况要求参与者提交或中止。数据库服务器不能直接联系协调者,因为所有通信都必须通过其客户端库进行。

Holding locks while in doubt 处于不确定状态时持有锁#

Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?
我们为什么如此关注事务陷入不确定状态?难道其他系统组件不能继续工作,忽略最终会被清理的不确定事务吗?

The problem is with locking. As discussed in “Read Committed”, database transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty writes. In addition, if you want serializable isolation, a database using two-phase locking would also have to take a shared lock on any rows read by the transaction.
问题是锁定。正如 “已提交” 章节中讨论的,数据库事务通常会对它们修改的任何行加行级排他锁,以防止脏写。此外,如果你想要可串行化隔离级别,使用两阶段锁定的数据库也必须对事务读取的任何行加共享锁。

The database cannot release those locks until the transaction commits or aborts (illustrated as a shaded area in Figure 8-13). Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least until the situation is manually resolved by an administrator.
数据库只有在事务提交或中止时才能释放这些锁(在图 8-13 中以阴影区域表示)。因此,在使用两阶段提交时,事务必须在不确定期间一直持有这些锁。如果协调者崩溃并且需要 20 分钟才能重新启动,这些锁将被持有 20 分钟。如果由于某种原因协调者的日志完全丢失,这些锁将被永远持有 —— 或者至少要等到管理员手动解决情况为止。

While those locks are held, no other transaction can modify those rows. Depending on the isolation level, other transactions may even be blocked from reading those rows. Thus, other transactions cannot simply continue with their business—if they want to access that same data, they will be blocked. This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.
当这些锁被持有时,其他事务无法修改这些行。根据隔离级别,其他事务甚至可能被阻止读取这些行。因此,其他事务不能简单地继续其业务 —— 如果它们想要访问相同的数据,它们将被阻止。这可能导致应用程序的大部分功能变得不可用,直到悬而未决的事务得到解决。

Recovering from coordinator failure 从协调器故障中恢复#

In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the log and resolve any in-doubt transactions. However, in practice, orphaned in-doubt transactions do occur [84,85]—that is, transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because the transaction log has been lost or corrupted due to a software bug). These transactions cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions.
理论上,如果协调器崩溃并重新启动,它应该能够从日志中干净地恢复其状态并解决任何悬而未决的事务。然而,在实践中,确实会发生孤儿悬而未决的事务 [84, 85]—— 即由于某种原因(例如,由于软件错误导致事务日志丢失或损坏),协调器无法决定其结果的交易。这些事务无法自动解决,因此它们将永远存在于数据库中,持有锁并阻止其他事务。

Even rebooting your database servers will not fix this problem, since a correct implementation of 2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk violating the atomicity guarantee). It’s a sticky situation.
即使重启你的数据库服务器也无法解决这个问题,因为正确实现的 2PC 必须确保即使在重启后,悬而未决的事务的锁也能得到保留(否则它将面临违反原子性保证的风险)。这是一个棘手的情况。

The only way out is for an administrator to manually decide whether to commit or roll back the transactions. The administrator must examine the participants of each in-doubt transaction, determine whether any participant has committed or aborted already, and then apply the same outcome to the other participants. Resolving the problem potentially requires a lot of manual effort, and most likely needs to be done under high stress and time pressure during a serious production outage (otherwise, why would the coordinator be in such a bad state?).
唯一的解决办法是管理员手动决定是否提交或回滚事务。管理员必须检查每个悬而未决事务的参与者,确定是否有参与者已经提交或中止,然后将相同的结果应用于其他参与者。解决这个问题可能需要大量的手动工作,而且很可能需要在严重的生产故障期间,在高压和时间紧迫的情况下进行(否则,协调者怎么会处于如此糟糕的状态呢?)。

Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator [74]. To be clear,heuristic here is a euphemism for probably breaking atomicity, since the heuristic decision violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for getting out of catastrophic situations, and not for regular use.
许多 XA 实现都有一个紧急逃生通道,称为启发式决策:允许参与者单方面决定中止或提交一个悬而未决的事务,而无需协调者的最终决定 [74]。要说明的是,这里的启发式是一个委婉的说法,可能破坏原子性,因为启发式决策违反了二阶段提交中的承诺系统。因此,启发式决策仅用于摆脱灾难性情况,而不是日常使用。

Problems with XA transactionsXA 事务的问题#

A single-node coordinator is a single point of failure for the entire system, and making it part of the application server is also problematic because the coordinator’s logs on its local disk become a crucial part of the durable system state—as important as the databases themselves.
单节点协调器是整个系统的单点故障,将其作为应用服务器的一部分也存在问题,因为协调器在本地磁盘上的日志成为持久系统状态的关键部分 —— 其重要性不亚于数据库本身。

In principle, the coordinator of an XA transaction could be highly available and replicated, just like we would expect of any other important database. Unfortunately, this still doesn’t solve a fundamental problem with XA, which is that it provides no way for the coordinator and the participants of a transaction to communicate with each other directly. They can only communicate via the application code that invoked the transaction, and the database drivers through which it calls the participants.
原则上,XA 事务的协调器可以是高度可用和可复制的,就像我们期望任何其他重要数据库一样。不幸的是,这仍然无法解决 XA 的一个根本问题,即它没有为协调器和事务参与者之间提供直接通信的方式。他们只能通过调用事务的应用代码以及通过该代码调用参与者的数据库驱动程序进行通信。

Even if the coordinator were replicated, the application code would therefore be a single point of failure. Solving this problem would require totally redesigning how application code is run to make it replicated or restartable, which could perhaps look similar to durable execution (see “Durable Execution and Workflows”). However, there don’t seem to be any tools that actually take this approach in practice.
即使协调器是可复制的,应用代码因此将是一个单点故障。解决这个问题需要完全重新设计应用代码的运行方式,使其可复制或可重启,这可能类似于持久执行(参见 “持久执行和工作流”)。然而,似乎在实践中并没有任何工具实际采用这种方法。

Another problem is that since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator. For example, it cannot detect deadlocks across different systems (since that would require a standardized protocol for systems to exchange information on the locks that each transaction is waiting for), and it does not work with SSI (see “Serializable Snapshot Isolation (SSI)”), since that would require a protocol for identifying conflicts across different systems.
另一个问题是,由于 XA 需要兼容各种数据系统,它必然是一个最低共同标准。例如,它无法检测跨系统的死锁(因为这需要系统间交换各事务等待的锁信息的标准化协议),并且它不适用于 SSI(参见 “可串行化快照隔离(SSI)”,因为这需要跨系统识别冲突的协议)。

These problems are somewhat inherent in performing transactions across heterogeneous technologies. However, keeping several heterogeneous data systems consistent with each other is still a real and important problem, so we need to find a different solution to it. This can be done, as we will see in the next section and in Chapter 12.
这些问题在跨异构技术执行事务时多少是固有的。然而,保持多个异构数据系统之间的一致性仍然是一个真实且重要的问题,因此我们需要找到不同的解决方案。这可以在下一节和第 12 章中看到。

Database-internal Distributed Transactions 数据库内部分布式事务#

As explained previously, there is a big difference between distributed transactions that span multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all the participating nodes are shards of the same database running the same software. Such internal distributed transactions are a defining feature of “NewSQL” databases such as CockroachDB [5], TiDB [6], Spanner [7], FoundationDB [8], and YugabyteDB, for example. Some message brokers such as Kafka also support internal distributed transactions [86].
如前所述,跨越多种异构存储技术的分布式事务与系统内部事务之间存在巨大差异 —— 即所有参与节点都是运行相同软件的同一数据库的片段。这类内部分布式事务是 “NewSQL” 数据库(如 CockroachDB [5]、TiDB [6]、Spanner [7]、FoundationDB [8] 和 YugabyteDB 等)的一个定义性特征。一些消息代理(如 Kafka)也支持内部分布式事务 [86]。

Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because their distributed transactions don’t need to interface with any other technologies, they avoid the lowest-common-denominator trap—the designers of these systems are free to use better protocols that are more reliable and faster.
这些系统中许多使用两阶段提交来确保写入多个片段的事务的原子性,但它们并未遭受与 XA 事务相同的问题。原因是它们的分布式事务无需与其他技术交互,因此避免了最低限度陷阱 —— 这些系统的设计者可以自由使用更可靠、更快的协议。

The biggest problems with XA can be fixed by:
XA 最大的问题可以通过:

  • Replicating the coordinator, with automatic failover to another coordinator node if the primary one crashes;
    复制协调器,并在主协调器崩溃时自动切换到另一个协调器节点;
  • Allowing the coordinator and data shards to communicate directly without going via application code;
    允许协调器和数据分片直接通信,无需通过应用程序代码;
  • Replicating the participating shards, so that the risk of having to abort a transaction because of a fault in one of the shards is reduced; and
    复制参与的分片,以降低因某个分片出现故障而需要中止事务的风险;以及
  • Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
    将原子提交协议与支持死锁检测和跨分片一致性读取的分布式并发控制协议相结合。

Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will see in Chapter 10 how atomic commitment for distributed transactions can be implemented using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one node to another without any human intervention, and while continuing to guarantee strong consistency properties.
共识算法通常用于复制协调器和数据库分片。在第 10 章中,我们将看到如何使用共识算法实现分布式事务的原子提交。这些算法通过自动从一台节点切换到另一台节点来容忍故障,无需人工干预,同时继续保证强一致性特性。

The isolation levels offered for distributed transactions depend on the system, but snapshot isolation and serializable snapshot isolation are both possible across shards. The details of how this works can be found in the papers referenced at the end of this chapter.
分布式事务提供的隔离级别取决于系统,但快照隔离和可序列化快照隔离都是跨分片可能的。这种工作原理的详细信息可以在本章末尾引用的论文中找到。

Exactly-once message processing revisited 精确一次消息处理再探讨#

We saw in “Exactly-once message processing” that an important use case for distributed transactions is to ensure that some operation takes effect exactly once, even if a crash occurs while it is being processed and the processing needs to be retried. If you can atomically commit a transaction across a message broker and a database, you can acknowledge the message to the broker if and only if it was successfully processed and the database writes resulting from the process were committed.
我们在 “精确一次消息处理” 中看到,分布式事务的一个重要用例是确保某些操作精确执行一次,即使处理过程中发生崩溃需要重试。如果你能跨消息代理和数据库原子提交事务,那么只有当消息成功处理且由处理产生的数据库写入已提交时,你才能向代理确认消息。

However, you don’t actually need such distributed transactions to achieve exactly-once semantics. An alternative approach is as follows, which only requires transactions within the database:
然而,实际上你并不需要这种分布式事务来实现精确一次语义。一种替代方法是如下所述,这只需要数据库内部的事务:

  1. Assume every message has a unique ID, and in the database you have a table of message IDs that have been processed. When you start processing a message from the broker, you begin a new transaction on the database, and check the message ID. If the same message ID is already present in the database, you know that it has already been processed, so you can acknowledge the message to the broker and drop it.
    假设每条消息都有一个唯一 ID,在数据库中你有一个已处理消息 ID 的表。当你从代理开始处理消息时,你在数据库上启动一个新事务,并检查消息 ID。如果数据库中已存在相同的消息 ID,你就知道它已经被处理过,因此可以向代理确认消息并丢弃它。
  2. If the message ID is not already in the database, you add it to the table. You then process the message, which may result in additional writes to the database within the same transaction. When you finish processing the message, you commit the transaction on the database.
    如果消息 ID 尚未存在于数据库中,则将其添加到表中。然后处理该消息,这可能导致在同一事务内对数据库进行额外写入。当您完成消息处理后,您提交数据库的事务。
  3. Once the database transaction is successfully committed, you can acknowledge the message to the broker.
    一旦数据库事务成功提交,您可以将消息确认给代理。
  4. Once the message has successfully been acknowledged to the broker, you know that it won’t try processing the same message again, so you can delete the message ID from the database (in a separate transaction).
    一旦消息成功确认给代理,您就清楚它不会再尝试处理相同消息,因此可以从数据库中删除消息 ID(在一个单独的事务中)。

If the message processor crashes before committing the database transaction, the transaction is aborted and the message broker will retry processing. If it crashes after committing but before acknowledging the message to the broker, it will also retry processing, but the retry will see the message ID in the database and drop it. If it crashes after acknowledging the message but before deleting the message ID from the database, you will have an old message ID lying around, which doesn’t do any harm besides taking a little bit of storage space. If a retry happens before the database transaction is aborted (which could happen if communication between the message processor and the database is interrupted), a uniqueness constraint on the table of message IDs should prevent the same message ID from being inserted by two concurrent transactions.
如果消息处理器在提交数据库事务之前崩溃,事务将被中止,消息代理将重试处理。如果它在提交后但在向代理确认消息之前崩溃,它也会重试处理,但重试时会看到数据库中的消息 ID 并将其丢弃。如果它在确认消息后但在从数据库删除消息 ID 之前崩溃,你会在周围留下一个旧的消息 ID,除了占用一点存储空间外不会造成任何伤害。如果在数据库事务中止之前发生重试(这可能是由于消息处理器和数据库之间的通信中断导致的),表中的消息 ID 的唯一性约束应该防止两个并发事务插入相同的消息 ID。

Thus, achieving exactly-once processing only requires transactions within the database—atomicity across database and message broker is not necessary for this use case. Recording the message ID in the database makes the message processing idempotent, so that message processing can be safely retried without duplicating its side-effects. A similar approach is used in stream processing frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in Chapter 12.
因此,要实现精确一次处理,仅需数据库内的事务 —— 在此用例中,数据库与消息代理之间的原子性并非必要。将消息 ID 记录在数据库中,使得消息处理具有幂等性,从而可以在不重复其副作用的情况下安全地重试消息处理。在 Kafka Streams 等流处理框架中,我们将在第 12 章看到,也使用类似的方法来实现精确一次语义。

However, internal distributed transactions within the database are still useful for the scalability of patterns such as these: for example, they would allow the message IDs to be stored on one shard and the main data updated by the message processing to be stored on other shards, and to ensure atomicity of the transaction commit across those shards.
然而,数据库内部的分布式事务对于此类模式的可扩展性仍然有用:例如,它们允许将消息 ID 存储在一个分片上,而将消息处理更新后的主数据存储在其他分片上,并确保这些分片之间的事务提交具有原子性。

Summary 摘要#

Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist. A large class of errors is reduced down to a simple transaction abort, and the application just needs to try again.
事务是一个抽象层,允许应用程序假装某些并发问题和某些硬件和软件故障不存在。一大类错误被简化为简单的交易中止,应用程序只需重试即可。

In this chapter we saw many examples of problems that transactions help prevent. Not all applications are susceptible to all those problems: an application with very simple access patterns, such as reading and writing only a single record, can probably manage without transactions. However, for more complex access patterns, transactions can hugely reduce the number of potential error cases you need to think about.
在本章中,我们看到了许多事务可以预防的问题示例。并非所有应用程序都容易受到所有这些问题的影响:具有非常简单访问模式的应用程序,例如只读取和写入单个记录,可能不需要事务也能管理。然而,对于更复杂的访问模式,事务可以大幅减少需要考虑的潜在错误情况。

Without transactions, various error scenarios (processes crashing, network interruptions, power outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various ways. For example, denormalized data can easily go out of sync with the source data. Without transactions, it becomes very difficult to reason about the effects that complex interacting accesses can have on the database.
没有事务,各种错误场景(进程崩溃、网络中断、停电、磁盘满、意外并发等)意味着数据可能以各种方式变得不一致。例如,非规范化数据很容易与源数据不同步。没有事务,很难推理复杂交互访问对数据库的影响。

In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular read committed, snapshot isolation (sometimes called repeatable read), and serializable. We characterized those isolation levels by discussing various examples of race conditions, summarized in Table 8-1:
在本章中,我们特别深入探讨了并发控制的主题。我们讨论了几个广泛使用的隔离级别,特别是读已提交、快照隔离(有时称为可重复读)和可序列化。我们通过讨论各种竞争条件示例来描述这些隔离级别,这些示例总结在表 8-1 中:

Isolation level 隔离级别Dirty reads 脏读Read skew 读倾斜Phantom reads 虚读Lost updates 丢失更新Write skew 写倾斜
Read uncommitted 未提交读✗ Possible ✗ 可能✗ Possible ✗ 可能✗ Possible ✗ 可能✗ Possible ✗ 可能✗ Possible ✗ 可能
Read committed 读已提交✓ Prevented ✓ 防止✗ Possible ✗ 可能的✗ Possible ✗ 可能的✗ Possible ✗ 可能的✗ Possible ✗ 可能的
Snapshot isolation 快照隔离✓ Prevented ✓ 防止✓ Prevented ✓ 防止✓ Prevented ✓ 防止? Depends? 取决于✗ Possible ✗ 可能
Serializable 可序列化✓ Prevented ✓ 防止✓ Prevented ✓ 防止了✓ Prevented ✓ 防止了✓ Prevented ✓ 防止了✓ Prevented ✓ 防止了

Dirty reads 脏读

One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads.
一个客户端在另一个客户端的写入提交之前读取了该写入。读已提交隔离级别和更严格的级别可以防止脏读。

Dirty writes 脏写

One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes.
一个客户端覆盖了另一个客户端已写入但尚未提交的数据。几乎所有的事务实现都防止脏读。

Read skew 读倾斜

A client sees different parts of the database at different points in time. Some cases of read skew are also known as nonrepeatable reads. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot corresponding to one particular point in time. It is usually implemented with multi-version concurrency control (MVCC).
一个客户端在不同时间点看到数据库的不同部分。某些读倾斜的情况也被称为不可重复读。这个问题通常通过快照隔离来防止,快照隔离允许事务从一个对应于特定时间点的、一致的快照中读取。它通常通过多版本并发控制(MVCC)来实现。

Lost updates 丢失更新

Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implementations of snapshot isolation prevent this anomaly automatically, while others require a manual lock (SELECT FOR UPDATE).
两个客户端同时执行读 - 修改 - 写循环。其中一个覆盖了另一个的写操作而没有包含其更改,导致数据丢失。一些快照隔离的实现会自动防止这种异常,而另一些则需要手动锁定( SELECT FOR UPDATE )。

Write skew 写倾斜

A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly.
一个事务读取某些内容,根据所看到的价值做出决策,并将决策写入数据库。然而,在写入操作完成时,决策的前提条件已经不再成立。只有可串行化隔离可以防止这种异常。

Phantom reads 虚读

A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.
一个事务读取符合某些搜索条件的数据对象。另一个客户端执行一个写入操作,该操作会影响到该搜索的结果。快照隔离可以防止直接的虚读,但在写入倾斜的上下文中,虚读需要特殊处理,例如使用索引范围锁。

Weak isolation levels protect against some of those anomalies but leave you, the application developer, to handle others manually (e.g., using explicit locking). Only serializable isolation protects against all of these issues. We discussed three different approaches to implementing serializable transactions:
弱隔离级别可以防止某些异常,但其他异常需要应用程序开发人员手动处理(例如,使用显式锁)。只有可序列化隔离可以防止所有这些问题。我们讨论了三种实现可序列化事务的不同方法:

Literally executing transactions in a serial order
按字面顺序执行事务

If you can make each transaction very fast to execute (typically by using stored procedures), and the transaction throughput is low enough to process on a single CPU core or can be sharded, this is a simple and effective option.
如果你能让每个事务执行得非常快(通常通过使用存储过程),并且事务吞吐量足够低,可以在单个 CPU 核心上处理或可以进行分片,那么这是一个简单且有效的选项。

Two-phase locking 两阶段锁

For decades this has been the standard way of implementing serializability, but many applications avoid using it because of its poor performance.
几十年来,这一直是实现可串行化的标准方法,但由于性能较差,许多应用程序避免使用它。

Serializable snapshot isolation (SSI)
可串行化快照隔离(SSI)

A comparatively new algorithm that avoids most of the downsides of the previous approaches. It uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction wants to commit, it is checked, and it is aborted if the execution was not serializable.
一种相对较新的算法,避免了之前方法的大部分缺点。它采用乐观方法,允许事务在不阻塞的情况下进行。当事务想要提交时,会进行检查,如果执行不可串行化,则会被中止。

Finally, we examined how to achieve atomicity when a transaction is distributed across multiple nodes, using two-phase commit. If those nodes are all running the same database software, distributed transactions can work quite well, but across different storage technologies (using XA transactions), 2PC is problematic: it is very sensitive to faults in the coordinator and the application code driving the transaction, and it interacts poorly with concurrency control mechanisms. Fortunately, idempotence can ensure exactly-once semantics without requiring atomic commit across different storage technologies, and we will see more on this in later chapters.
最后,我们探讨了当事务跨多个节点分布时如何通过两阶段提交实现原子性。如果这些节点都运行相同的数据库软件,分布式事务可以运行得相当好,但在不同的存储技术(使用 XA 事务)之间,2PC 存在问题:它对协调器和驱动事务的应用代码的故障非常敏感,并且与并发控制机制交互不佳。幸运的是,幂等性可以在不要求跨不同存储技术实现原子提交的情况下确保精确一次语义,我们将在后面的章节中对此进行更多讨论。

The examples in this chapter used a relational data model. However, as discussed in “The need for multi-object transactions”, transactions are a valuable database feature, no matter which data model is used.
本章的示例使用了关系数据模型。然而,正如在 “需要多对象事务的必要性” 中讨论的那样,无论使用哪种数据模型,事务都是数据库的一个宝贵特性。

Footnotes 脚注#
References 参考文献#

[1] Steven J. Murdoch.What went wrong with Horizon: learning from the Post Office Trial. benthamsgaze.org, July 2021. Archived at perma.cc/CNM4-553F

[2] Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, and Robert A. Yost.A History and Evaluation of System R. Communications of the ACM, volume 24, issue 10, pages 632–646, October 1981.doi.1145/358769.358784
[2] Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, 和 Robert A. Yost. System R 的历史与评估. ACM 通讯, 卷 24, 第 10 期, 第 632-646 页, 1981 年 10 月. doi.1145/358769.358784

[3] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger.Granularity of Locks and Degrees of Consistency in a Shared Data Base. in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in Readings in Database Systems, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1
[3] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, 和 Irving L. Traiger. 共享数据库中的锁粒度和一致性程度. 在数据库管理系统建模:IFIP 工作会议论文集,由 G. M. Nijssen 编辑,第 364-394 页,Elsevier / North Holland 出版,1976 年. 也收录于《数据库系统阅读材料》,第 4 版,由 Joseph M. Hellerstein 和 Michael Stonebraker 编辑,MIT 出版社,2005 年. ISBN: 978-0-262-69314-1

[4] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger.The Notions of Consistency and Predicate Locks in a Database System. Communications of the ACM, volume 19, issue 11, pages 624–633, November 1976.doi.1145/360363.360369
[4] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, 和 Irving L. Traiger. 数据库系统中的持续性和谓词锁概念. ACM 通讯, 卷 19, 第 11 期, 页 624–633, 1976 年 11 月. doi.1145/360363.360369

[5] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis.CockroachDB: The Resilient Geo-Distributed SQL Database. At ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1493–1509, June 2020.doi.1145/3318464.3386134
[5] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, 和 Peter Mattis. CockroachDB: 弹性地理分布式 SQL 数据库. 在 ACM SIGMOD 国际数据管理会议 (SIGMOD), 页 1493–1509, 2020 年 6 月. doi.1145/3318464.3386134

[6] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang.TiDB: a Raft-based HTAP database.Proceedings of the VLDB Endowment, volume 13, issue 12, pages 3072–3084.doi.14778/3415478.3415535
[6] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, 和 Xin Tang. TiDB: 基于 Raft 的高时序分析数据库. VLDB 基金会会议录, 卷 13, 第 12 期, 页 3072–3084. doi.14778/3415478.3415535

[7] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.Spanner: Google’s Globally-Distributed Database. At 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.
[7] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, 和 Ruth Wang. Spanner: Google 的全球分布式数据库. 在第 10 届 USENIX 操作系统设计与实现 symposium (OSDI), 2012 年 10 月.

[8] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav.FoundationDB: A Distributed Unbundled Transactional Key Value Store. At ACM International Conference on Management of Data (SIGMOD), June 2021.doi.1145/3448016.3457559
[8] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, 和 Vishesh Yadav. FoundationDB: 一个分布式解耦事务键值存储. 在 ACM 国际数据管理会议 (SIGMOD), 2021 年 6 月. doi.1145/3448016.3457559

[9] Theo Härder and Andreas Reuter.Principles of Transaction-Oriented Database Recovery. ACM Computing Surveys, volume 15, issue 4, pages 287–317, December 1983. doi.1145/289.291
[9] Theo Härder 和 Andreas Reuter. 面向事务数据库恢复的原则. ACM 计算机调查, 卷 15, 期 4, 页 287–317, 1983 年 12 月. doi.1145/289.291

[10] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.HAT, not CAP: Towards Highly Available Transactions. At 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2013.
[10] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, 和 Ion Stoica. HAT, 不是 CAP: 迈向高可用性事务. 在第 14 届 USENIX 热点操作系统研讨会 (HotOS), 2013 年 5 月.

[11] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier.Cluster-Based Scalable Network Services. At 16th ACM Symposium on Operating Systems Principles (SOSP), October 1997.doi.1145/268998.266662
[11] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, 和 Paul Gauthier. 基于集群的可扩展网络服务. 在第 16 届 ACM 操作系统原理研讨会 (SOSP), 1997 年 10 月. doi.1145/268998.266662

[12] Tony Andrews.Enforcing Complex Constraints in Oracle. tonyandrews.blogspot.co.uk, October 2004. Archived at archive.org
[12] Tony Andrews. 在 Oracle 中强制执行复杂约束. tonyandrews.blogspot.co.uk, 2004 年 10 月. 已归档于 archive.org

[13] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman.Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at microsoft.com.
[13] Philip A. Bernstein, Vassos Hadzilacos, 和 Nathan Goodman. 数据库系统的并发控制和恢复. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, 可在线获取自 microsoft.com.

[14] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, and Dennis Shasha.Making Snapshot Isolation Serializable. ACM Transactions on Database Systems, volume 30, issue 2, pages 492–528, June 2005.doi.1145/1071610.1071615
[14] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, 和 Dennis Shasha. 使快照隔离可串行化. ACM 数据库系统学报, 卷 30, 第 2 期, 第 492–528 页, 2005 年 6 月. doi.1145/1071610.1071615

[15] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge.Understanding the Robustness of SSDs Under Power Fault. At 11th USENIX Conference on File and Storage Technologies (FAST), February 2013.
[15] Mai Zheng, Joseph Tucek, Feng Qin, 和 Mark Lillibridge. 理解 SSD 在电源故障下的鲁棒性. 在第 11 届 USENIX 文件和存储技术会议 (FAST), 2013 年 2 月.

[16] Laurie Denness.SSDs: A Gift and a Curse.laur.ie, June 2015. Archived at perma.cc/6GLP-BX3T
[16] Laurie Denness. SSD:既是礼物也是诅咒. laur.ie, 2015 年 6 月. 已存档于 perma.cc / 6GLP - BX3T

[17] Adam Surak.When Solid State Drives Are Not That Solid. blog.algolia.com, June 2015. Archived at perma.cc/CBR9-QZEE
[17] Adam Surak. 当固态驱动器并不那么坚固. blog.algolia.com, 2015 年 6 月. 存档于 perma.cc/CBR9-QZEE

[18] Hewlett Packard Enterprise.Bulletin: (Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation.support.hpe.com, November 2019. Archived at perma.cc/CZR4-AQBS
[18] 惠普企业. 公告:(修订版)HPE SAS 固态驱动器 - 特定 HPE SAS 固态驱动器型号需进行关键固件升级以防止在运行 32,768 小时后驱动器故障. support.hpe.com, 2019 年 11 月. 存档于 perma.cc/CZR4-AQBS

[19] Craig Ringer et al.PostgreSQL’s handling of fsync() errors is unsafe and risks data loss at least on XFS. Email thread on pgsql-hackers mailing list, postgresql.org, March 2018. Archived at perma.cc/5RKU-57FL
[19] Craig Ringer 等人. PostgreSQL 处理 fsync () 错误的方式不安全,至少在 XFS 上会存在数据丢失风险. pgsql-hackers 邮件列表中的邮件讨论串, postgresql.org, 2018 年 3 月. 存档于 perma.cc / 5RKU - 57FL

[20] Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.Can Applications Recover from fsync Failures? At USENIX Annual Technical Conference (ATC), July 2020.
[20] Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau 和 Remzi H. Arpaci-Dusseau. 应用程序能否从 fsync 失败中恢复?在 USENIX 年度技术会议 (ATC), 2020 年 7 月.

[21] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.Crash Consistency: Rethinking the Fundamental Abstractions of the File System. ACM Queue, volume 13, issue 7, pages 20–28, July 2015.doi.1145/2800695.2801719
[21] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, 和 Remzi H. Arpaci-Dusseau. Crash Consistency: Rethinking the Fundamental Abstractions of the File System. ACM Queue, 第 13 卷, 第 7 期, 第 20-28 页, 2015 年 7 月. doi.1145/2800695.2801719

[22] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. At 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.
[22] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, 和 Remzi H. Arpaci-Dusseau. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. 在第 11 届 USENIX 操作系统设计与实现 symposium (OSDI), 2014 年 10 月.

[23] Chris Siebenmann.Unix’s File Durability Problem. utcc.utoronto.ca, April 2016. Archived at perma.cc/VSS8-5MC4
[23] Chris Siebenmann. Unix’s File Durability Problem. utcc.utoronto.ca, 2016 年 4 月. 永久存档于 perma.cc/VSS8-5MC4

[24] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. At 15th USENIX Conference on File and Storage Technologies (FAST), February 2017.
[24] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, 和 Remzi H. Arpaci-Dusseau. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. 在第 15 届 USENIX 文件与存储技术 conference (FAST), 2017 年 2 月.

[25] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.An Analysis of Data Corruption in the Storage Stack. At 6th USENIX Conference on File and Storage Technologies (FAST), February 2008.
[25] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau 和 Remzi H. Arpaci-Dusseau. 存储栈中数据损坏的分析. 在第 6 届 USENIX 文件和存储技术会议(FAST)上,2008 年 2 月.

[26] Richard van der Hoff.How we discovered, and recovered from, Postgres corruption on the matrix.org homeserver. matrix.org, July 2025. Archived at perma.cc/CDF5-NRBK
[26] Richard van der Hoff. 我们如何发现并从 matrix.org homeserver 的 Postgres 损坏中恢复过来. matrix.org, 2025 年 7 月. 永久存档于 perma.cc/CDF5-NRBK

[27] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant.Flash Reliability in Production: The Expected and the Unexpected. At 14th USENIX Conference on File and Storage Technologies (FAST), February 2016.
[27] Bianca Schroeder, Raghav Lagisetty 和 Arif Merchant. 生产中的闪存可靠性:预期与意外. 在第 14 届 USENIX 文件和存储技术会议(FAST)上,2016 年 2 月.

[28] Don Allison.SSD Storage – Ignorance of Technology Is No Excuse. blog.korelogic.com, March 2015. Archived at perma.cc/9QN4-9SNJ
[28] Don Allison. SSD 存储 —— 技术无知不是借口. blog.korelogic.com, 2015 年 3 月. 永久存档于 perma.cc / 9QN4 - 9SNJ

[29] Gordon Mah Ung.Debunked: Your SSD won’t lose data if left unplugged after all. pcworld.com, May 2015. Archived at perma.cc/S46H-JUDU
[29] Gordon Mah Ung. 《辟谣:即使拔掉电源,你的 SSD 也不会丢失数据》. pcworld.com, 2015 年 5 月. 永久存档于 perma.cc / S46H - JUDU

[30] Martin Kleppmann.Hermitage: Testing the ‘I’ in ACID. martin.kleppmann.com, November 2014. Archived at perma.cc/KP2Y-AQGK
[30] Martin Kleppmann. 《隐士:测试 ACID 中的 “I”》. martin.kleppmann.com, 2014 年 11 月. 永久存档于 perma.cc / KP2Y - AQGK

[31] Todd Warszawski and Peter Bailis.ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications. At ACM International Conference on Management of Data (SIGMOD), May 2017.doi.1145/3035918.3064037
[31] Todd Warszawski 和 Peter Bailis. 《ACIDRain:针对数据库支持型 Web 应用的并发相关攻击》. 在 ACM 国际数据管理会议(SIGMOD),2017 年 5 月. doi.1145/3035918.3064037

[32] Tristan D’Agosta.BTC Stolen from Poloniex.bitcointalk.org, March 2014. Archived at perma.cc/YHA6-4C5D
[32] Tristan D’Agosta. 《Poloniex 被盗的比特币》. bitcointalk.org, 2014 年 3 月. 永久存档于 perma.cc/YHA6-4C5D

[33] bitcointhief2.How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!reddit.com, February 2014. Archived at archive.org
[33] bitcointhief2. 我如何从交易所盗取约 100 个 BTC 以及我本可以盗取更多!reddit.com,2014 年 2 月。存档于 archive.org。

[34] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan.Automating the Detection of Snapshot Isolation Anomalies. At 33rd International Conference on Very Large Data Bases (VLDB), September 2007.
[34] Sudhir Jorwekar、Alan Fekete、Krithi Ramamritham 和 S. Sudarshan. 自动检测快照隔离异常。第 33 届国际大型数据库会议(VLDB),2007 年 9 月。

[35] Michael Melanson.Transactions: The Limits of Isolation. michaelmelanson.net, November 2014. Archived at perma.cc/RG5R-KMYZ
[35] Michael Melanson. 事务:隔离的极限。michaelmelanson.net,2014 年 11 月。存档于 perma.cc / RG5R - KMYZ。

[36] Edward Kim.How ACH works: A developer perspective — Part 1. engineering.gusto.com, April 2014. Archived at perma.cc/7B2H-PU94
[36] Edward Kim. ACH 工作原理:开发者的视角 —— 第一部分。engineering.gusto.com,2014 年 4 月。存档于 perma.cc / 7B2H - PU94。

[37] Hal Berenson, Philip A. Bernstein, Jim N. Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil.A Critique of ANSI SQL Isolation Levels. At ACM International Conference on Management of Data (SIGMOD), May 1995. doi.1145/568271.223785
[37] Hal Berenson, Philip A. Bernstein, Jim N. Gray, Jim Melton, Elizabeth O’Neil 和 Patrick O’Neil. 《ANSI SQL 隔离级别的批判》. 在 ACM 国际数据管理会议(SIGMOD),1995 年 5 月. doi.1145/568271.223785

[38] Atul Adya. Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions. PhD Thesis, Massachusetts Institute of Technology, March 1999. Archived at perma.cc/E97M-HW5Q
[38] Atul Adya. 《弱一致性:分布式事务的广义理论与乐观实现》. 博士论文,麻省理工学院,1999 年 3 月. 存档于 perma.cc / E97M - HW5Q

[39] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.Highly Available Transactions: Virtues and Limitations. At 40th International Conference on Very Large Data Bases (VLDB), September 2014.
[39] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein 和 Ion Stoica. 《高可用性事务:优点与局限》. 在第 40 届国际超大型数据库会议(VLDB),2014 年 9 月.

[40] Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement.Seeing is Believing: A Client-Centric Specification of Database Isolation. At ACM Symposium on Principles of Distributed Computing (PODC), pages 73–82, July 2017.doi.1145/3087801.3087802
[40] Natacha Crooks, Youer Pu, Lorenzo Alvisi 和 Allen Clement. 《眼见为实:面向客户端的数据库隔离规范》. 在 ACM 分布式计算原理研讨会(PODC),第 73-82 页,2017 年 7 月. doi.1145/3087801.3087802

[41] Bruce Momjian.MVCC Unmasked. momjian.us, July 2014. Archived at perma.cc/KQ47-9GYB
[41] Bruce Momjian. MVCC Unmasked. momjian.us, 2014 年 7 月. 归档于 perma.cc/KQ47-9GYB

[42] Peter Alvaro and Kyle Kingsbury.MySQL 8.0.34. jepsen.io, December 2023. Archived at perma.cc/HGE2-Z878
[42] Peter Alvaro 和 Kyle Kingsbury. MySQL 8.0.34. jepsen.io, 2023 年 12 月. 归档于 perma.cc/HGE2-Z878

[43] Egor Rogov.PostgreSQL 14 Internals.postgrespro.com, April 2023. Archived at perma.cc/FRK2-D7WB
[43] Egor Rogov. PostgreSQL 14 Internals. postgrespro.com, 2023 年 4 月. 归档于 perma.cc/FRK2-D7WB

[44] Hironobu Suzuki.The Internals of PostgreSQL.interdb.jp, 2017.
[44] Hironobu Suzuki. The Internals of PostgreSQL. interdb.jp, 2017 年.

[45] Rohan Reddy Alleti.Internals of MVCC in Postgres: Hidden costs of Updates vs Inserts. medium.com, March 2025. Archived at perma.cc/3ACX-DFXT
[45] Rohan Reddy Alleti. Postgres 中 MVCC 的内部机制:更新与插入的隐藏成本。medium.com,2025 年 3 月。存档于 perma.cc / 3ACX - DFXT

[46] Andy Pavlo and Bohan Zhang.The Part of PostgreSQL We Hate the Most. cs.cmu.edu, April 2023. Archived at perma.cc/XSP6-3JBN
[46] Andy Pavlo 和 Bohan Zhang。我们最讨厌的 PostgreSQL 部分。cs.cmu.edu,2023 年 4 月。存档于 perma.cc/XSP6-3JBN

[47] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo.An empirical evaluation of in-memory multi-version concurrency control. Proceedings of the VLDB Endowment, volume 10, issue 7, pages 781–792, March 2017.doi.14778/3067421.3067427
[47] Yingjun Wu、Joy Arulraj、Jiexi Lin、Ran Xian 和 Andrew Pavlo。内存多版本并发控制的实证评估。《VLDB 会议论文集》,第 10 卷,第 7 期,第 781–792 页,2017 年 3 月。doi.14778/3067421.3067427

[48] Nikita Prokopov.Unofficial Guide to Datomic Internals. tonsky.me, May 2014.
[48] Nikita Prokopov。Datomic 内部机制非官方指南。tonsky.me,2014 年 5 月。

[49] Daniil Svetlov.A Practical Guide to Taming Postgres Isolation Anomalies. dansvetlov.me, March 2025. Archived at perma.cc/L7LE-TDLS
[49] Daniil Svetlov. 《驯服 Postgres 隔离异常的实用指南》. dansvetlov.me, 2025 年 3 月. 归档于 perma.cc / L7LE - TDLS

[50] Nate Wiger.An Atomic Rant. nateware.com, February 2010. Archived at perma.cc/5ZYB-PE44
[50] Nate Wiger. 《原子级抱怨》. nateware.com, 2010 年 2 月. 归档于 perma.cc / 5ZYB - PE44

[51] James Coglan.Reading and writing, part 3: web applications. blog.jcoglan.com, October 2020. Archived at perma.cc/A7EK-PJVS
[51] James Coglan. 《读写,第三部分:Web 应用程序》. blog.jcoglan.com, 2020 年 10 月. 归档于 perma.cc / A7EK - PJVS

[52] Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity. At ACM International Conference on Management of Data (SIGMOD), June 2015.doi.1145/2723372.2737784
[52] Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, 和 Ion Stoica. 《野性并发控制:现代应用程序完整性的实证研究》. 在 ACM 国际数据管理会议 (SIGMOD), 2015 年 6 月. doi.1145/2723372.2737784

[53] Jaana Dogan.Things I Wished More Developers Knew About Databases. rakyll.medium.com, April 2020. Archived at perma.cc/6EFK-P2TD
[53] Jaana Dogan. 我希望更多开发者了解数据库. rakyll.medium.com, 2020 年 4 月. 永久存档于 perma.cc / 6EFK - P2TD

[54] Michael J. Cahill, Uwe Röhm, and Alan Fekete.Serializable Isolation for Snapshot Databases. At ACM International Conference on Management of Data (SIGMOD), June 2008.doi.1145/1376616.1376690
[54] Michael J. Cahill, Uwe Röhm, 和 Alan Fekete. 快照数据库的可串行隔离. 在 ACM 国际数据管理会议 (SIGMOD), 2008 年 6 月. doi.1145/1376616.1376690

[55] Dan R. K. Ports and Kevin Grittner.Serializable Snapshot Isolation in PostgreSQL. At 38th International Conference on Very Large Databases (VLDB), August 2012.
[55] Dan R. K. Ports 和 Kevin Grittner. PostgreSQL 中的可序列化快照隔离. 在第 38 届 Very Large Databases (VLDB) 国际会议上,2012 年 8 月.

[56] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser.Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. At 15th ACM Symposium on Operating Systems Principles (SOSP), December 1995.doi.1145/224056.224070
[56] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer 和 Carl H. Hauser. 在弱连接复制的存储系统 Bayou 中管理更新冲突. 在第 15 届 ACM 操作系统原理研讨会 (SOSP) 上,1995 年 12 月. doi.1145/224056.224070

[57] Hans-Jürgen Schönig.Constraints over multiple rows in PostgreSQL. cybertec-postgresql.com, June 2021. Archived at perma.cc/2TGH-XUPZ
[57] Hans-Jürgen Schönig. PostgreSQL 中的多行约束. cybertec-postgresql.com, 2021 年 6 月. 存档于 perma.cc / 2TGH - XUPZ

[58] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland.The End of an Architectural Era (It’s Time for a Complete Rewrite). At 33rd International Conference on Very Large Data Bases (VLDB), September 2007.
[58] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem 和 Pat Helland. 一个架构时代的终结(是时候进行全面重写了). 在第 33 届 Very Large Data Bases (VLDB) 国际会议上,2007 年 9 月.

[59] John Hugg.H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures. At Data @Scale Boston, November 2014.
[59] John Hugg. H-Store/VoltDB 架构与 CEP 系统及新型流式架构的比较. 在 Data @Scale 波士顿会议上发表,2014 年 11 月。

[60] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi.H-Store: A High-Performance, Distributed Main Memory Transaction Processing System. Proceedings of the VLDB Endowment, volume 1, issue 2, pages 1496–1499, August 2008.
[60] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, 和 Daniel J. Abadi. H-Store:一个高性能分布式内存事务处理系统. VLDB 基金会会议论文集,第 1 卷,第 2 期,第 1496-1499 页,2008 年 8 月。

[61] Rich Hickey.The Architecture of Datomic.infoq.com, November 2012. Archived at perma.cc/5YWU-8XJK
[61] Rich Hickey. Datomic 的架构. infoq.com,2012 年 11 月. 永久存档于 perma.cc / 5YWU - 8XJK。

[62] John Hugg.Debunking Myths About the VoltDB In-Memory Database. dzone.com, May 2014. Archived at perma.cc/2Z9N-HPKF
[62] John Hugg. 解构关于 VoltDB 内存数据库的迷思. dzone.com,2014 年 5 月. 永久存档于 perma.cc / 2Z9N - HPKF。

[63] Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker.OLTP Through the Looking Glass 16 Years Later: Communication is the New Bottleneck. At 15th Annual Conference on Innovative Data Systems Research (CIDR), January 2025.
[63] 周新静,Leis 维克多,余祥耀,和 Michael Stonebraker。OLTP 穿越镜面 16 年后:通信是新的瓶颈。在第 15 届创新数据系统研究年会上(CIDR),2025 年 1 月。

[64] Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker.Lotus: scalable multi-partition transactions on single-threaded partitioned databases. Proceedings of the VLDB Endowment (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022.doi.14778/3551793.3551843
[64] 周新静,余祥耀,Goetz Graefe,和 Michael Stonebraker。Lotus:单线程分区数据库上的可扩展多分区事务。VLDB 基金会会议录(PVLDB),第 15 卷,第 11 期,第 2939–2952 页,2022 年 7 月。doi.14778/3551793.3551843

[65] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton.Architecture of a Database System.Foundations and Trends in Databases, volume 1, issue 2, pages 141–259, November 2007.doi.1561/1900000002
[65] Joseph M. Hellerstein,Michael Stonebraker,和 James Hamilton。数据库系统架构。数据库基础与趋势,第 1 卷,第 2 期,第 141–259 页,2007 年 11 月。doi.1561/1900000002

[66] Michael J. Cahill.Serializable Isolation for Snapshot Databases. PhD Thesis, University of Sydney, July 2009. Archived at perma.cc/727J-NTMP
[66] Michael J. Cahill。快照数据库的可串行隔离。博士论文,悉尼大学,2009 年 7 月。存档于 perma.cc / 727J - NTMP

[67] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling.Hekaton: SQL Server’s Memory-Optimized OLTP Engine. At ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1243–1254, June 2013.doi.1145/2463676.2463710
[67] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, 和 Mike Zwilling. Hekaton: SQL Server 的内存优化 OLTP 引擎. 在 ACM SIGMOD 国际数据管理会议 (SIGMOD), 第 1243–1254 页, 2013 年 6 月. doi.1145/2463676.2463710

[68] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper.Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. At ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 677–689, May 2015.doi.1145/2723372.2749436
[68] Thomas Neumann, Tobias Mühlbauer, 和 Alfons Kemper. 主存数据库系统的快速可串行化多版本并发控制. 在 ACM SIGMOD 国际数据管理会议 (SIGMOD), 第 677–689 页, 2015 年 5 月. doi.1145/2723372.2749436

[69] D. Z. Badal.Correctness of Concurrency Control and Implications in Distributed Databases. At 3rd International IEEE Computer Software and Applications Conference (COMPSAC), November 1979.doi.1109/CMPSAC.1979.762563
[69] D. Z. Badal. 并发控制的正确性及分布式数据库中的影响. 在第 3 届国际 IEEE 计算机软件与应用会议 (COMPSAC), 1979 年 11 月. doi.1109/CMPSAC.1979.762563

[70] Rakesh Agrawal, Michael J. Carey, and Miron Livny.Concurrency Control Performance Modeling: Alternatives and Implications. ACM Transactions on Database Systems (TODS), volume 12, issue 4, pages 609–654, December 1987.doi.1145/32204.32220
[70] Rakesh Agrawal, Michael J. Carey, 和 Miron Livny. 并发控制性能建模:替代方案及影响. ACM 数据库系统事务期刊 (TODS), 第 12 卷, 第 4 期, 第 609–654 页, 1987 年 12 月. doi.1145/32204.32220

[71] Marc Brooker.Snapshot Isolation vs Serializability. brooker.co.za, December 2024. Archived at perma.cc/5TRC-CR5G
[71] Marc Brooker. 快照隔离与可串行化. brooker.co.za, 2024 年 12 月. 永久存档于 perma.cc / 5TRC - CR5G

[72] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade.Notes on Distributed Databases. IBM Research, Research Report RJ2571(33471), July 1979. Archived at perma.cc/EPZ3-MHDD
[72] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, 和 B. W. Wade. 分布式数据库笔记. IBM Research, 研究报告 RJ2571 (33471), 1979 年 7 月. 永久存档于 perma.cc/EPZ3-MHDD

[73] C. Mohan, Bruce G. Lindsay, and Ron Obermarck.Transaction Management in the R* Distributed Database Management System.ACM Transactions on Database Systems, volume 11, issue 4, pages 378–396, December 1986.doi.1145/7239.7266
[73] C. Mohan, Bruce G. Lindsay, 和 Ron Obermarck. R* 分布式数据库管理系统中的事务管理. ACM 数据库系统汇刊, 第 11 卷, 第 4 期, 第 378–396 页, 1986 年 12 月. doi.1145/7239.7266

[74] X/Open Company Ltd.Distributed Transaction Processing: The XA Specification. Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, archived at perma.cc/Z96H-29JB
[74] X/Open 公司有限公司. 分布式事务处理: XA 规范. 技术标准 XO/CAE/91/300, 1991 年 12 月. ISBN: 978-1-872-63024-3, 永久存档于 perma.cc/Z96H - 29JB

[75] Ivan Silva Neto and Francisco Reverbel.Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction. At 7th IEEE/ACIS International Conference on Computer and Information Science (ICIS), May 2008.doi.1109/ICIS.2008.75
[75] Ivan Silva Neto 和 Francisco Reverbel. 实施 WS-Coordination 和 WS-AtomicTransaction 的经验教训. 在第 7 届 IEEE / ACIS 国际计算机与信息科学会议 (ICIS) 上,2008 年 5 月. doi.1109/ICIS.2008.75

[76] James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt.Formal Specification of a Web Services Protocol. At 1st International Workshop on Web Services and Formal Methods (WS-FM), February 2004.doi.1016/j.entcs.2004.02.022
[76] James E. Johnson, David E. Langworthy, Leslie Lamport 和 Friedrich H. Vogt. Web 服务协议的形式规范. 在第 1 届 Web 服务与形式方法国际研讨会 (WS - FM) 上,2004 年 2 月. doi.1016/j.entcs.2004.02.022

[77] Jim Gray.The Transaction Concept: Virtues and Limitations. At 7th International Conference on Very Large Data Bases (VLDB), September 1981.
[77] Jim Gray. 事务概念:优点与局限性. 在第 7 届 very large 数据库国际会议 (VLDB) 上,1981 年 9 月.

[78] Dale Skeen.Nonblocking Commit Protocols. At ACM International Conference on Management of Data (SIGMOD), April 1981.doi.1145/582318.582339
[78] Dale Skeen. 非阻塞提交协议. 在 ACM 国际数据管理会议 (SIGMOD) 上,1981 年 4 月. doi.1145/582318.582339

[79] Gregor Hohpe.Your Coffee Shop Doesn’t Use Two-Phase Commit. IEEE Software, volume 22, issue 2, pages 64–66, March 2005.doi.1109/MS.2005.52
[79] Gregor Hohpe. 你的咖啡店不使用两阶段提交。IEEE Software,第 22 卷,第 2 期,第 64-66 页,2005 年 3 月。doi.1109/MS.2005.52

[80] Pat Helland.Life Beyond Distributed Transactions: An Apostate’s Opinion. At 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.
[80] Pat Helland. 分布式事务之外的生活:一个异见者的观点。在第三届创新数据系统研究年会上,2007 年 1 月。

[81] Jonathan Oliver.My Beef with MSDTC and Two-Phase Commits. blog.jonathanoliver.com, April 2011. Archived at perma.cc/K8HF-Z4EN
[81] Jonathan Oliver. 我对 MSDTC 和两阶段提交的抱怨。blog.jonathanoliver.com,2011 年 4 月。存档于 perma.cc / K8HF - Z4EN

[82] Oren Eini (Ahende Rahien).The Fallacy of Distributed Transactions. ayende.com, July 2014. Archived at perma.cc/VB87-2JEF
[82] Oren Eini (Ahende Rahien). 分布式事务的错误。ayende.com,2014 年 7 月。存档于 perma.cc/VB87-2JEF

[83] Clemens Vasters.Transactions in Windows Azure (with Service Bus) – An Email Discussion. learn.microsoft.com, July 2012. Archived at perma.cc/4EZ9-5SKW
[83] Clemens Vasters. Windows Azure 中的事务(与服务总线)—— 电子邮件讨论. learn.microsoft.com, 2012 年 7 月. 永久存档于 perma.cc / 4EZ9 - 5SKW

[84] Ajmer Dhariwal.Orphaned MSDTC Transactions (-2 spids). eraofdata.com, December 2008. Archived at perma.cc/YG6F-U34C
[84] Ajmer Dhariwal. 孤立的 MSDTC 事务 (-2 spids). eraofdata.com, 2008 年 12 月. 永久存档于 perma.cc / YG6F - U34C

[85] Paul Randal.Real World Story of DBCC PAGE Saving the Day. sqlskills.com, June 2013. Archived at perma.cc/2MJN-A5QH
[85] Paul Randal. DBCC PAGE 的真实故事:拯救危机. sqlskills.com, 2013 年 6 月. 永久存档于 perma.cc / 2MJN - A5QH

[86] Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao.Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka. At ACM International Conference on Management of Data (SIGMOD), June 2021.doi.1145/3448016.3457556
[86] Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. 一致性与完整性:重新思考 Apache Kafka 中的分布式流处理. 在 ACM 国际数据管理会议 (SIGMOD), 2021 年 6 月. doi.1145/3448016.3457556