Chapter 10. Consistency and Consensus 第 10 章. 一致性和共识#
An ancient adage warns, “Never go to sea with two chronometers; take one or three.”
古老的谚语告诫:“不要同时携带两个计时器出海;带一个或三个。”Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering (1995)
弗雷德里克・P・布鲁克斯, Jr.,《人月神话:软件工程论文》(1995)
Lots of things can go wrong in distributed systems, as discussed in Chapter 9. If we want a service to continue working correctly despite those things going wrong, we need to find ways of tolerating faults.
分布式系统可能会出现各种问题,正如第 9 章所述。如果我们希望服务在这些问题发生时仍能正常运行,我们就需要找到容忍错误的方法。
One of the best tools we have for fault tolerance is replication. However, as we saw in Chapter 6, having multiple copies of the data on multiple replicas opens up the risk of inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results. If multiple replicas can accept writes, we have to deal with conflicts between values that were concurrently written on different replicas. At a high level, there are two competing philosophies for dealing with such issues:
我们拥有的容错工具中,复制是最有效的一种。然而,正如我们在第 6 章所见,在多个副本上保留数据的多个副本会带来不一致的风险。读取操作可能会由一个未更新的副本处理,从而产生过时的结果。如果多个副本可以接受写入,我们就必须处理在不同副本上并发写入的值之间的冲突。从高层次来看,处理此类问题的两种主要哲学是:
Eventual consistency 最终一致性
In this philosophy, the fact that a system is replicated is made visible to the application, and you as application developer are expected to deal with the inconsistencies and conflicts that may arise. This approach is often used in systems with multi-leader (see “Multi-Leader Replication”) and leaderless replication (see “Leaderless Replication”).
在这种哲学中,系统复制的实际情况对应用程序可见,而作为应用程序开发者,你被期望处理可能产生的不一致性和冲突。这种方法通常用于多领导者(参见 “多领导者复制”)和无领导者复制(参见 “无领导者复制”)的系统。
Strong consistency 强一致性
This philosophy says that applications should not have to worry about internal details of replication, and that the system should behave as if it was single-node. The advantage of this approach is that it’s simpler for you, the application developer. The disadvantage is that stronger consistency has a performance cost, and some kinds of fault that an eventually consistent system can tolerate cause outages in strongly consistent systems.
这种理念认为,应用程序不应该担心复制的内部细节,系统应该表现得就像它是单节点的一样。这种方法的优点是,对你这个应用程序开发者来说更简单。缺点是,更强的致性有性能成本,而最终一致性系统可以容忍的一些故障,在强一致性系统中会导致停机。
As always, which approach is better depends on your application. If you have an app where users can make changes to data while offline, then eventual consistency is inevitable, as discussed in “Sync Engines and Local-First Software”. However, eventual consistency can also be difficult for applications to deal with. If your replicas are located in datacenters with fast, reliable communication, then strong consistency is often appropriate because its cost is acceptable.
始终如此,哪种方法更好取决于你的应用。如果你有一个用户可以在离线状态下修改数据的程序,那么最终一致性是不可避免的,正如在 “同步引擎和本地优先软件” 中讨论的那样。然而,最终一致性也可能对应用程序来说难以处理。如果你的副本位于通信快速可靠的数据中心,那么强一致性通常是合适的,因为它的成本是可以接受的。
In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:
在本章中,我们将深入探讨强一致性方法,关注三个领域:
- One challenge is that “strong consistency” is quite vague, so we will develop a more precise definition of what we want to achieve: linearizability.
一个挑战在于 “强一致性” 相当模糊,因此我们将发展一个更精确的定义,即我们想要达成的目标:线性化。 - We will look at the problem of generating IDs and timestamps. This may sound unrelated to consistency but is actually closely connected.
我们将研究生成 ID 和时间戳的问题。这听起来似乎与一致性无关,但实际上是密切相关的。 - We will explore how distributed systems can achieve linearizability while still remaining fault-tolerant; the answer is consensus algorithms.
我们将探讨分布式系统如何在保持容错性的同时实现线性化;答案是共识算法。
Along the way, we will see that there are some fundamental limits on what is possible and what is not in a distributed system.
在这个过程中,我们将看到分布式系统中存在一些基本的可能性限制和不可能性。
The topics of this chapter are notorious for being hard to implement correctly; it’s very easy to build systems that behave fine when there are no faults, but which completely fall apart when faced with an unlucky combination of faults that the designer of the system hadn’t considered. A lot of theory has been developed to help us think through those edge cases, which enables us to build systems that can robustly tolerate faults.
本章的主题以难以正确实现而闻名;当没有故障时,构建的系统表现良好,但当面临系统设计者未考虑到的不幸故障组合时,系统会完全崩溃。为了帮助我们思考这些边缘情况,已经发展了很多理论,这使我们能够构建能够稳健地容忍故障的系统。
This chapter will only scratch the surface: we will stick with informal intuitions, and avoid the algorithmic nitty-gritty, formal models, and proofs. If you want to do serious work on consensus systems and similar infrastructure, you will need to go much deeper into the theory if you want any chance of your systems being robust. As usual, the literature references in this chapter provide some initial pointers.
本章仅触及表面:我们将坚持非正式直觉,避免算法细节、形式模型和证明。如果你想在共识系统及类似基础设施上做严肃工作,若想系统具备鲁棒性,你需要深入理论。一如既往,本章的文献参考提供了一些初步指引。
Linearizability 线性化一致性#
If you want a replicated database to be as simple as possible to use, you should make it behave as if it wasn’t replicated at all. Then users don’t have to worry about replication lag, conflicts, and other inconsistencies. That would give us the advantage of fault tolerance, but without the complexity arising from having to think about multiple replicas.
如果你希望复制的数据库尽可能简单易用,你应该让它表现得好像根本没有复制。这样用户就不必担心复制延迟、冲突和其他不一致性。这将给我们带来容错性的优势,但避免了因考虑多个副本而产生的复杂性。
This is the idea behind linearizability [1] (also known as atomic consistency [2],strong consistency, immediate consistency, or external consistency [3]). The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.
这是线性化 [1] 的核心思想(也称为原子一致性 [2]、强一致性、即时一致性或外部一致性 [3])。线性化的精确定义相当微妙,我们将在本节剩余部分进行探讨。但基本思想是让系统看起来只有一份数据副本,并且所有操作都是原子的。有了这个保证,即使现实中存在多个副本,应用程序也不必担心它们。
In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee. To clarify this idea, let’s look at an example of a system that is not linearizable.
在线性化系统中,一旦某个客户端成功完成一次写入,所有从数据库读取的客户端都必须能够看到刚刚写入的值。维持单份数据的幻觉意味着保证读取的值是最新的、最新的值,而不是来自过时的缓存或副本。换句话说,线性化是一种时效性保证。为了阐明这个概念,让我们看一个非线性化系统的例子。

Figure 10-1 shows an example of a nonlinearizable sports website [4]. Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits reload on his own phone, but his request goes to a database replica that is lagging, and so his phone shows that the game is still ongoing.
图 10-1 展示了一个非线性化的体育网站示例 [4]。Aaliyah 和 Bryce 坐在同一间房间里,都在用手机查看他们喜欢的球队比赛的结果。就在最终比分公布后,Aaliyah 刷新了页面,看到赢家被宣布,兴奋地告诉了 Bryce。Bryce 怀疑地在他的手机上点击了刷新,但他的请求发送到了一个延迟的数据库副本,因此他的手机显示比赛仍在进行中。
If Aaliyah and Bryce had hit reload at the same time, it would have been less surprising if they had gotten two different query results, because they wouldn’t know at exactly what time their respective requests were processed by the server. However, Bryce knows that he hit the reload button (initiated his query) after he heard Aaliyah exclaim the final score, and therefore he expects his query result to be at least as recent as Aaliyah’s. The fact that his query returned a stale result is a violation of linearizability.
如果 Aaliyah 和 Bryce 同时点击了刷新,他们得到两个不同的查询结果会不那么令人惊讶,因为他们不知道各自请求何时被服务器处理。然而,Bryce 知道他在听到 Aaliyah 喊出最终比分后才点击了刷新按钮(发起了他的查询),因此他期望他的查询结果至少和 Aaliyah 的一样新。他的查询返回了一个过时的结果,这违反了线性化性。
What Makes a System Linearizable? 什么让系统线性化?#
In order to understand linearizability better, let’s look at some more examples.Figure 10-2 shows three clients concurrently reading and writing the same object x in a linearizable database. In distributed systems theory, x is called a register —in practice, it could be one key in a key-value store, one row in a relational database, or one document in a document database, for example.
为了更好地理解线性化,让我们来看一些更多例子。图 10-2 展示了三个客户端在一个线性化数据库中并发读取和写入同一个对象 x。在分布式系统理论中,x 被称为寄存器 —— 在实践中,它可能是一个键值存储中的一个键,一个关系数据库中的一行,或一个文档数据库中的一个文档,例如。

For simplicity, Figure 10-2 shows only the requests from the clients’ point of view, not the internals of the database. Each bar is a request made by a client, where the start of a bar is the time when the request was sent, and the end of a bar is when the response was received by the client. Due to variable network delays, a client doesn’t know exactly when the database processed its request—it only knows that it must have happened sometime between the client sending the request and receiving the response.
为简化起见,图 10-2 仅展示了客户端的请求视角,而不是数据库的内部结构。每个条形代表一个客户端发起的请求,条形的开始是请求发送的时间,条形的结束是客户端收到响应的时间。由于网络延迟的变化,客户端不知道数据库何时处理了它的请求 —— 它只知道这个处理必须在客户端发送请求和收到响应之间的某个时间点发生。
In this example, the register has two types of operations:
在这个例子中,寄存器有两种类型的操作:
- read (x) ⇒ v means the client requested to read the value of register x, and the database returned the value v.
read (x) ⇒ v 表示客户端请求读取寄存器 x 的值,数据库返回了值 v。 - write (x, v) ⇒ r means the client requested to set the register x to value v, and the database returned response r (which could be ok or error).
write (x, v) ⇒ r 表示客户端请求将寄存器 x 设置为值 v,数据库返回响应 r(可能是成功或错误)。
In Figure 10-2, the value of x is initially 0, and client C performs a write request to set it to 1. While this is happening, clients A and B are repeatedly polling the database to read the latest value. What are the possible responses that A and B might get for their read requests?
在图 10-2 中,x 的初始值为 0,客户端 C 执行写请求将其设置为 1。在此期间,客户端 A 和 B 反复轮询数据库以读取最新值。A 和 B 的读请求可能得到哪些响应?
- The first read operation by client A completes before the write begins, so it must definitely return the old value 0.
客户端 A 的第一个读操作在写操作开始前完成,因此它必须返回旧值 0。 - The last read by client A begins after the write has completed, so it must definitely return the new value 1 if the database is linearizable, because the read must have been processed after the write.
客户端 A 的最后一个读操作在写操作完成后开始,如果数据库是可线性化的,它必须返回新值 1,因为读操作必须在写操作之后处理。 - Any read operations that overlap in time with the write operation might return either 0 or 1, because we don’t know whether or not the write has taken effect at the time when the read operation is processed. These operations are concurrent with the write.
任何与写操作时间重叠的读操作可能返回 0 或 1,因为我们不知道写操作在读操作处理时是否已经生效。这些操作与写操作并发进行。
However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent with a write can return either the old or the new value, then readers could see a value flip back and forth between the old and the new value several times while a write is going on. That is not what we expect of a system that emulates a “single copy of the data.”
然而,这还不足以完全描述线性化性:如果与写入并发进行的读取可以返回旧值或新值,那么在写入进行时,读取者可能会看到值在旧值和新值之间来回切换多次。这不是我们期望从模拟 “数据单拷贝” 的系统中所看到的行为。
To make the system linearizable, we need to add another constraint, illustrated in Figure 10-3.
为了让系统满足线性化要求,我们需要增加另一个约束,如图 10-3 所示。

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.
在线性化系统中,我们假设在写操作开始和结束之间必须存在某个时间点,此时 x 的值会原子性地从 0 变为 1。因此,如果某个客户端的读操作返回了新值 1,所有后续的读操作也必须返回新值,即使写操作尚未完成。
This timing dependency is illustrated with an arrow in Figure 10-3. Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read. Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is still ongoing. (It’s the same situation as with Aaliyah and Bryce in Figure 10-1: after Aaliyah has read the new value, Bryce also expects to read the new value.)
这种时间依赖关系在图 10-3 中用箭头表示。客户端 A 是第一个读取新值 1 的客户端。A 的读操作返回后不久,B 开始一个新的读操作。由于 B 的读操作严格发生在 A 的读操作之后,它也必须返回 1,尽管 C 的写操作仍在进行中。(这与图 10-1 中 Aaliyah 和 Bryce 的情况相同:Aaliyah 读取了新值后,Bryce 也期望读取新值。)
We can further refine this timing diagram to visualize each operation taking effect atomically at some point in time [5], like in the more complex example shown in Figure 10-4. In this example we add a third type of operation besides read and write:
我们可以进一步细化这个时序图,以可视化每个操作在某个时间点原子性地生效 [5],就像在图 10-4 中展示的更复杂的例子那样。在这个例子中,除了读和写之外,我们添加了第三种类型的操作:
- cas (x, v old, v new) ⇒ r means the client requested an atomic compare-and-set operation (see “Conditional writes (compare-and-set)”). If the current value of the register x equals v old, it should be atomically set to v new. If the value of x is different from v old, then the operation should leave the register unchanged and return an error. r is the database’s response (ok or error).
cas(x, v old, v new ) ⇒ r 表示客户端请求了一个原子比较并设置操作(参见 “条件写入(比较并设置)”。如果寄存器 x 的当前值等于 v old ,它应该原子性地设置为 v new 。如果 x 的值与 v old 不同,那么操作应该保持寄存器不变并返回错误。r 是数据库的响应(成功或错误)。
Each operation in Figure 10-4 is marked with a vertical line (inside the bar for each operation) at the time when we think the operation was executed. Those markers are joined up in a sequential order, and the result must be a valid sequence of reads and writes for a register (every read must return the value set by the most recent write).
图 10-4 中的每个操作都在我们认为操作执行的时间点用一条垂直线(在每个操作条形内)标记。这些标记按顺序连接起来,结果必须是一个有效的寄存器读和写序列(每个读操作都必须返回最近一次写入设置的值)。
The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward. This requirement ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.
线性化要求连接操作标记的线始终按时间向前移动(从左到右),绝不能向后。这一要求确保了我们之前讨论的时效性保证:一旦写入或读取了新值,所有后续读取都会看到写入的值,直到它再次被覆盖。

There are a few interesting details to point out in Figure 10-4:
在图 10-4 中有几点值得注意:
- First client B sent a request to read x, then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the network, so it only reached the database after the two writes.
首先,客户端 B 发送了一个读取 x 的请求,然后客户端 D 发送了一个将 x 设置为 0 的请求,接着客户端 A 发送了一个将 x 设置为 1 的请求。然而,返回给 B 的读取值是 1(A 写入的值)。这没问题:这意味着数据库首先处理了 D 的写入,然后是 A 的写入,最后是 B 的读取。尽管这不是请求发送的顺序,但它是一个可接受的顺序,因为这三个请求是并发的。也许 B 的读取请求在网络中稍有延迟,所以它到达数据库时,两个写入操作已经完成了。 - Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it just means the ok response from the database to client A was slightly delayed in the network.
客户端 B 的读操作在客户端 A 从数据库收到写值 1 成功的响应之前就返回了 1。这也完全可以接受:这意味着数据库向客户端 A 的确认响应在网络中稍有延迟。 - This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).
这个模型不假设任何事务隔离性:另一个客户端可以在任何时间改变一个值。例如,C 先读 1,然后读 2,因为值在两次读操作之间被 B 改变了。可以使用原子比较并设置(cas)操作来检查值是否被其他客户端并发改变:B 和 C 的 cas 请求成功,但 D 的 cas 请求失败(当数据库处理它时,x 的值已经不是 0 了)。 - The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah and Bryce in Figure 10-1.
客户端 B 的最终读操作(在阴影条中)不能线性化。该操作与 C 的 cas 写操作并发进行,后者将 x 的值从 2 更新为 4。如果没有其他请求,B 的读操作返回 2 也是可以接受的。然而,在 B 的读操作开始之前,客户端 A 已经读取了新的值 4,因此 B 不允许读取比 A 更旧的值。再次强调,这与图 10-1 中的 Aaliyah 和 Bryce 的情况相同。
That is the intuition behind linearizability; the formal definition [1] describes it more precisely. It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order [6,7].
这就是线性化可解性的直观理解;正式定义 [1] 对其描述更为精确。通过记录所有请求和响应的时间,并检查它们是否能够排列成一个有效的顺序,可以(尽管计算成本很高)测试系统的行为是否是线性化的 [6, 7]。
Just as there are various weak isolation levels for transactions besides serializability (see “Weak Isolation Levels”), there are also various weaker consistency models for replicated systems besides linearizability [8]. In fact, the read-after-write, monotonic reads, and consistent prefix reads properties we saw in “Problems with Replication Lag” are examples of such weaker consistency models. Linearizability guarantees all these weaker properties, and more. In this chapter we will focus on linearizability, which is the strongest consistency model in common use.
正如除了串行化之外还有各种弱隔离级别(参见 “弱隔离级别”),除了线性化之外,复制系统也有各种较弱的强一致性模型 [8]。事实上,我们在 “复制延迟问题” 中看到的读后写、单调读和一致前缀读属性就是这种较弱强一致性模型的例子。线性化保证了所有这些较弱的属性,以及更多。在本章中,我们将重点关注线性化,这是最常用的最强一致性模型。
Relying on Linearizability 依赖线性化#
In what circumstances is linearizability useful? Viewing the final score of a sporting match is perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any real harm in this situation. However, there a few areas in which linearizability is an important requirement for making a system work correctly.
在线性化有什么用?观看体育比赛的最终得分或许是一个轻率的例子:结果落后几秒钟不太可能在这种情况下造成任何实际伤害。然而,在线性化是确保系统能够正确运行的重要要求的一些领域。
Locking and leader election 锁定和领导选举#
A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lease: every node that starts up tries to acquire the lease, and the one that succeeds becomes the leader [17]. No matter how this mechanism is implemented, it must be linearizable: it should not be possible for two different nodes to acquire the lease at the same time.
一个使用单领导复制的系统需要确保确实只有一个领导,而不是多个(脑裂)。一种选举领导的方法是使用租约:每个启动的节点尝试获取租约,成功者成为领导 [17]。无论这种机制如何实现,它必须是线性化的:不应该允许两个不同的节点同时获取租约。
Coordination services like Apache ZooKeeper [18] and etcd are often used to implement distributed leases and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter). There are still many subtle details to implementing leases and leader election correctly (see for example the fencing issue in “Distributed Locks and Leases”), and libraries like Apache Curator help by providing higher-level recipes on top of ZooKeeper. However, a linearizable storage service is the basic foundation for these coordination tasks.
协调服务如 Apache ZooKeeper [18] 和 etcd 常用于实现分布式租约和领导选举。它们使用共识算法以容错的方式实现线性化操作(我们将在本章后面讨论此类算法)。实现租约和领导选举时仍有许多微妙细节需要注意(例如 “分布式锁和租约” 中的围栏问题),而像 Apache Curator 这样的库通过在 ZooKeeper 之上提供更高级的配方来帮助实现这些功能。然而,线性化存储服务是这些协调任务的基础。
Note 注意#
Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no guarantee that they are served from the current leader [18]. etcd since version 3 provides linearizable reads by default.
严格来说,ZooKeeper 提供线性化写入,但读取可能过时,因为没有保证它们是从当前领导者提供的 [18]。etcd 从 3 版本开始默认提供线性化读取。
Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC) [19]. RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system. Since these linearizable locks are on the critical path of transaction execution, RAC deployments usually have a dedicated cluster interconnect network for communication between database nodes.
分布式锁在一些分布式数据库中也用于更细粒度的级别,例如 Oracle Real Application Clusters (RAC) [19]。RAC 对每个磁盘页使用一个锁,多个节点共享对同一磁盘存储系统的访问。由于这些线性化锁位于事务执行的关键路径上,RAC 部署通常有一个专用的集群互连网络,用于数据库节点之间的通信。
Constraints and uniqueness guarantees 约束和唯一性保证#
Uniqueness constraints are common in databases: for example, a username or email address must uniquely identify one user, and in a file storage service there cannot be two files with the same path and filename. If you want to enforce this constraint as the data is written (such that if two people try to concurrently create a user or a file with the same name, one of them will be returned an error), you need linearizability.
唯一性约束在数据库中很常见:例如,用户名或电子邮件地址必须唯一地标识一个用户,在文件存储服务中不能有两个具有相同路径和文件名的文件。如果你希望在写入数据时强制执行此约束(即如果两个人尝试并发创建具有相同名称的用户或文件,其中一个人将返回错误),你需要线性化。
This situation is actually similar to a lock: when a user registers for your service, you can think of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic compare-and-set, setting the username to the ID of the user who claimed it, provided that the username is not already taken.
这种情况实际上类似于一个锁:当用户注册你的服务时,你可以认为他们获取了对其所选用户名的 “锁”。该操作也非常类似于一个原子比较并设置操作,将用户名设置为声称该用户名的用户的 ID,前提是该用户名尚未被占用。
Similar issues arise if you want to ensure that a bank account balance never goes negative, or that you don’t sell more items than you have in stock in the warehouse, or that two people don’t concurrently book the same seat on a flight or in a theater. These constraints all require there to be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.
如果你想要确保银行账户余额永远不会为负,或者你不会出售比你仓库中拥有的更多商品,或者两个人不会同时预订同一航班或剧场的座位,也会出现类似的问题。这些约束都需要有一个所有节点都同意的单一最新值(账户余额、库存水平、座位占用情况)。
In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if a flight is overbooked, you can move customers to a different flight and offer them compensation for the inconvenience). In such cases, linearizability may not be needed, and we will discuss such loosely interpreted constraints in [Link to Come].
在实际应用中,有时可以宽松地处理这些约束(例如,如果航班超售,你可以将乘客转移到其他航班并为他们带来的不便提供补偿)。在这种情况下,线性化可能不是必需的,我们将讨论这种宽松解释的约束,见 [待链接]。
However, a hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, can be implemented without linearizability [20].
然而,严格的唯一性约束,如关系数据库中常见的约束,需要线性化。其他类型的约束,如外键或属性约束,可以在不满足线性化的情况下实现 [20]。
Cross-channel timing dependencies 跨通道时间依赖#
Notice a detail in Figure 10-1: if Aaliyah hadn’t exclaimed the score, Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the page again a few seconds later, and eventually seen the final score. The linearizability violation was only noticed because there was an additional communication channel in the system (Aaliyah’s voice to Bryce’s ears).
注意图 10-1 中的细节:如果 Aaliyah 没有喊出比分,Bryce 就不会知道他的查询结果已经过时。他会在几秒钟后再次刷新页面,最终看到最终比分。线性化违规之所以被发现,是因为系统中存在额外的通信通道(Aaliyah 的声音传到 Bryce 的耳朵里)。
Similar situations can arise in computer systems. For example, say you have a website where users can upload a video, and a background process transcodes the video to a lower quality that can be streamed on slow internet connections. The architecture and dataflow of this system is illustrated in Figure 10-5.
类似的情况也可能出现在计算机系统中。例如,假设你有一个网站,用户可以上传视频,后台进程将视频转码为较低质量,以便在慢速互联网连接上流式传输。该系统的架构和数据流如图 10-5 所示。
The video transcoder needs to be explicitly instructed to perform a transcoding job, and this instruction is sent from the web server to the transcoder via a message queue (see Chapter 12). The web server doesn’t place the entire video on the queue, since most message brokers are designed for small messages, and a video may be many megabytes in size. Instead, the video is first written to a file storage service, and once the write is complete, the instruction to the transcoder is placed on the queue.
视频转码器需要被明确指示执行转码任务,该指示通过消息队列从 Web 服务器发送给转码器(参见第 12 章)。Web 服务器不会将整个视频放入队列,因为大多数消息代理是为小消息设计的,而视频可能包含数兆字节的数据。相反,视频首先被写入文件存储服务,一旦写入完成,转码器的指令就会被放入队列。

If the file storage service is linearizable, then this system should work fine. If it is not linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in Figure 10-5) might be faster than the internal replication inside the storage service. In this case, when the transcoder fetches the original video (step 5), it might see an old version of the file, or nothing at all. If it processes an old version of the video, the original and transcoded videos in the file storage become permanently inconsistent with each other.
如果文件存储服务是可线性化的,那么这个系统应该能正常工作。如果不可线性化,则存在竞态条件风险:消息队列(图 10-5 中的步骤 3 和 4)可能比存储服务内部的内部复制更快。在这种情况下,当转码器获取原始视频(步骤 5)时,它可能会看到一个旧版本的文件,或者什么都没有。如果它处理了一个旧版本的视频,文件存储中的原始视频和转码视频将永久性地变得彼此不一致。
This problem arises because there are two different communication channels between the web server and the transcoder: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 10-1, where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Aaliyah’s mouth and Bryce’s ears.
这个问题产生的原因是,在 Web 服务器和转码器之间存在两个不同的通信通道:文件存储和消息队列。由于缺乏线性化可保证的新鲜性,这两个通道之间可能发生竞态条件。这种情况类似于图 10-1,其中也存在两个通信通道之间的竞态条件:数据库复制和 Aaliyah 的嘴与 Bryce 的耳朵之间的现实音频通道。
A similar race condition occurs if you have a mobile app that can receive push notifications, and the app fetches some data from a server when it receives a push notification. If the data fetch might go to a lagging replica, it could happen that the push notification goes through quickly, but the subsequent fetch doesn’t see the data that the push notification was about.
如果你有一个可以接收推送通知的手机应用,并且当收到推送通知时应用会从服务器获取一些数据,那么可能会出现类似的竞态条件。如果数据获取可能会连接到一个响应迟缓的副本,那么推送通知可能会快速通过,但后续获取操作却看不到推送通知所涉及的数据。
Linearizability is not the only way of avoiding this race condition, but it’s the simplest to understand. If you control the additional communication channel (like in the case of the message queue, but not in the case of Aaliyah and Bryce), you can use alternative approaches similar to what we discussed in “Reading Your Own Writes”, at the cost of additional complexity.
线性化不是避免这种竞态条件的唯一方法,但它是最容易理解的方法。如果你控制了额外的通信通道(例如在消息队列的情况下,但在 Aaliyah 和 Bryce 的情况下不是这样),你可以使用类似于我们在 “读取自己的写入” 中讨论的替代方法,但需要付出额外的复杂性的代价。
Implementing Linearizable Systems 实现线性化系统#
Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we might implement a system that offers linearizable semantics.
既然我们已经看到了几个线性化有用的例子,那么让我们思考一下我们如何实现一个提供线性化语义的系统。
Since linearizability essentially means “behave as though there is only a single copy of the data, and all operations on it are atomic,” the simplest answer would be to really only use a single copy of the data. However, that approach would not be able to tolerate faults: if the node holding that one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.
由于线性化一致性本质上意味着 “表现得好像只有一份数据副本,并且所有操作都是原子的”,最简单的答案似乎是确实只使用一份数据副本。然而,这种方法无法容忍故障:如果持有这份唯一副本的节点发生故障,数据将会丢失,或者至少在节点重新启动之前无法访问。
Let’s revisit the replication methods from Chapter 6, and compare whether they can be made linearizable:
让我们重新审视第 6 章中的复制方法,并比较它们是否可以被线性化:
Single-leader replication (potentially linearizable)
单领导者复制(可能线性化)
In a system with single-leader replication, the leader has the primary copy of the data that is used for writes, and the followers maintain backup copies of the data on other nodes. As long as you perform all reads and writes on the leader, they are likely to be linearizable. However, this assumes that you know for sure who the leader is. As discussed in “Distributed Locks and Leases”, it is quite possible for a node to think that it is the leader, when in fact it is not—and if the delusional leader continues to serve requests, it is likely to violate linearizability [21]. With asynchronous replication, failover may even lose committed writes, which violates both durability and linearizability.
在一个单领导者复制系统中,领导者拥有数据的主副本,用于写入操作,而跟随者在其他节点上维护数据的备份副本。只要你在领导者上执行所有读和写操作,它们很可能是可线性化的。然而,这假设你确切地知道谁是领导者。正如在 “分布式锁和租约” 中讨论的,一个节点可能会认为自己是领导者,但实际上并不是 —— 如果这个幻想中的领导者继续处理请求,很可能会违反线性化性 [21]。在异步复制中,故障转移甚至可能会丢失已提交的写入,这会违反持久性和线性化性。
Sharding a single-leader database, with a separate leader per shard, does not affect linearizability, since it is only a single-object guarantee. Cross-shard transactions are a different matter (see “Distributed Transactions”).
将单领导者数据库分片,每个分片有单独的领导者,不会影响线性化性,因为它只是一个单对象保证。跨分片事务是另一回事(参见 “分布式事务”)。
Consensus algorithms (likely linearizable)
共识算法(很可能是可线性化的)
Some consensus algorithms are essentially single-leader replication with automatic leader election and failover. They are carefully designed to prevent split brain, allowing them to implement linearizable storage safely. ZooKeeper uses the Zab consensus algorithm [22] and etcd uses Raft [23], for example. However, just because a system uses consensus does not guarantee that all operations on it are linearizable: if it allows reads on a node without checking that it is still the leader, the results of the read may be stale if a new leader has just been elected.
有些共识算法本质上是一种单领导者复制,具有自动领导者选举和故障转移功能。它们经过精心设计以防止脑裂,从而能够安全地实现线性化存储。例如,ZooKeeper 使用 Zab 共识算法 [22],etcd 使用 Raft [23]。然而,仅仅因为系统使用共识并不意味着其上的所有操作都是线性化的:如果它允许在没有检查是否仍然是领导者的情况下读取某个节点,而此时刚刚选出了新的领导者,那么读取的结果可能会过时。
Multi-leader replication (not linearizable)
多领导者复制(非线性化)
Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution (see “Dealing with Conflicting Writes”).
具有多领导者复制的系统通常不是线性化的,因为它们同时处理多个节点的写入操作,并异步地将这些写入复制到其他节点。因此,它们可能会产生需要解决的冲突写入(参见 “处理冲突写入”)。
Leaderless replication (probably not linearizable)
无领导者复制(可能非线性化)
For systems with leaderless replication (Dynamo-style; see “Leaderless Replication”), people sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n). Depending on the exact algorithm, and depending on how you define strong consistency, this is not quite true.
对于领导者复制(Dynamo 风格;参见 “领导者复制”)的系统,人们有时声称可以通过要求多数派读取和写入(w + r > n)来获得 “强一致性”。这取决于具体的算法,以及你对强一致性的定义,这并不完全正确。
“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and ScyllaDB) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew (see “Relying on Synchronized Clocks”). Even with quorums, nonlinearizable behavior is possible, as demonstrated in the next section.
基于每日时钟的 “最后写入者胜出” 冲突解决方法(例如在 Cassandra 和 ScyllaDB 中)几乎肯定是非线性化的,因为时钟时间戳无法保证与实际事件顺序一致,这是由于时钟偏移(参见 “依赖同步时钟”)。即使有仲裁,也可能出现非线性化行为,下一节将进行说明。
Linearizability and quorums 线性化与门限#
Intuitively, it seems as though quorum reads and writes should be linearizable in a Dynamo-style model. However, when we have variable network delays, it is possible to have race conditions, as demonstrated in Figure 10-6.
直观上,似乎在 Dynamo 风格的模型中,门限读取和写入应该是线性化的。然而,当存在可变网络延迟时,可能会出现竞争条件,如图 10-6 所示。

In Figure 10-6, the initial value of x is 0, and a writer client is updating x to 1 by sending the write to all three replicas (n = 3, w = 3). Concurrently, client A reads from a quorum of two nodes (r = 2) and sees the new value 1 on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two nodes, and gets back the old value 0 from both.
在图 10-6 中,x 的初始值为 0,一个写入客户端通过向所有三个副本发送写入请求来更新 x 为 1(n = 3,w = 3)。同时,客户端 A 从包含两个节点的门限读取(r = 2),并在其中一个节点上看到新值 1。同样在写入操作的同时,客户端 B 从另一个包含两个节点的门限读取,并从两个副本上都得到了旧值 0。
The quorum condition is met (w + r > n), but this execution is nevertheless not linearizable: B’s request begins after A’s request completes, but B returns the old value while A returns the new value. (It’s once again the Aaliyah and Bryce situation from Figure 10-1.)
仲裁条件满足(w + r > n),但这次执行仍然无法线性化:B 的请求开始于 A 的请求完成之后,但 B 返回旧值时 A 返回新值。(这又是图 10-1 中的 Aaliyah 和 Bryce 的情况。)
It is possible to make Dynamo-style quorums linearizable at the cost of reduced performance: a reader must perform read repair (see “Catching up on missed writes”) synchronously, before returning results to the application [24]. Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the latest timestamp of any prior write, and ensure that the new write has a greater timestamp [25,26]. However, Riak does not perform synchronous read repair due to the performance penalty. Cassandra does wait for read repair to complete on quorum reads [27], but it loses linearizability due to its use of time-of-day clocks for timestamps.
可以通过降低性能的方式使 Dynamo 风格的仲裁线性化:读者必须执行读修复(参见 “追赶上错过的写入”),在返回结果给应用程序之前同步进行 [24]。此外,在写入之前,写入者必须读取一组节点的最新状态,以获取任何先前写入的最新时间戳,并确保新的写入具有更大的时间戳 [25, 26]。然而,Riak 由于性能损耗而不执行同步读修复。Cassandra 在仲裁读取时会等待读修复完成 [27],但由于它使用日时钟来记录时间戳,因此失去了线性化能力。
Moreover, only linearizable read and write operations can be implemented in this way; a linearizable compare-and-set operation cannot, because it requires a consensus algorithm [28].
此外,只有线性化的读和写操作才能以这种方式实现;线性化的比较并设置操作不能,因为它需要共识算法 [28]。
In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability, even with quorum reads and writes.
总之,即使使用 Dynamo 风格的复制和仲裁读取 / 写入,无主系统也不提供线性化,这是最安全的假设。
The Cost of Linearizability 线性化的代价#
As some replication methods can provide linearizability and others cannot, it is interesting to explore the pros and cons of linearizability in more depth.
由于某些复制方法可以提供线性化而另一些则不能,因此深入探讨线性化的利弊很有趣。
We already discussed some use cases for different replication methods in Chapter 6; for example, we saw that multi-leader replication is often a good choice for multi-region replication (see “Geographically Distributed Operation”). An example of such a deployment is illustrated in Figure 10-7.
我们在第 6 章已经讨论了不同复制方法的一些用例;例如,我们看到多主复制通常适用于多区域复制(参见 “地理分布式操作”)。这种部署的一个示例在图 10-7 中进行了说明。

Consider what happens if there is a network interruption between the two regions. Let’s assume that the network within each region is working, and clients can reach their local region, but the regions cannot connect to each other. This is known as a network partition.
考虑如果两个区域之间存在网络中断会发生什么。假设每个区域内的网络都在运行,客户端可以连接到本地区域,但区域之间无法互相连接。这种情况被称为网络分区。
With a multi-leader database, each region can continue operating normally: since writes from one region are asynchronously replicated to the other, the writes are simply queued up and exchanged when network connectivity is restored.
在多主数据库中,每个区域可以继续正常运行:由于一个区域的写入会异步复制到其他区域,因此当网络连接恢复时,这些写入会简单地排队并交换。
On the other hand, if single-leader replication is used, then the leader must be in one of the regions. Any writes and any linearizable reads must be sent to the leader—thus, for any clients connected to a follower region, those read and write requests must be sent synchronously over the network to the leader region.
另一方面,如果使用单主复制,则领导者必须位于其中一个区域。任何写入和任何可线性化读取都必须发送到领导者 —— 因此,对于连接到跟随者区域的任何客户端,这些读取和写入请求必须通过网络同步发送到领导者区域。
If the network between regions is interrupted in a single-leader setup, clients connected to follower regions cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads. They can still make reads from the follower, but they might be stale (nonlinearizable). If the application requires linearizable reads and writes, the network interruption causes the application to become unavailable in the regions that cannot contact the leader.
在单主设置中,如果区域之间的网络中断,连接到跟随者区域的客户端无法联系到领导者,因此他们无法向数据库进行任何写入,也无法进行任何可线性化读取。他们仍然可以从跟随者读取,但可能会过时(不可线性化)。如果应用程序需要可线性化读取和写入,网络中断会导致无法联系到领导者的区域的应用程序不可用。
If clients can connect directly to the leader region, this is not a problem, since the application continues to work normally there. But clients that can only reach a follower region will experience an outage until the network link is repaired.
如果客户端可以直接连接到领导者区域,这不是问题,因为应用程序在那里可以正常运行。但只能连接到追随者区域的客户端将经历服务中断,直到网络连接修复。
The CAP theorem CAP 定理#
This issue is not just a consequence of single-leader and multi-leader replication: any linearizable database has this problem, no matter how it is implemented. The issue also isn’t specific to multi-region deployments, but can occur on any unreliable network, even within one region. The trade-off is as follows:
这个问题不仅仅是单领导者或多领导者复制的结果:任何可线性化的数据库都有这个问题,无论其如何实现。这个问题也并非特定于多区域部署,任何不可靠的网络都可能发生,即使在同一区域内。权衡如下:
- If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the network problem is fixed, or return an error (either way, they become unavailable). This choice is sometimes known as CP (consistent under network partitions).
如果您的应用程序需要线性化,并且由于网络问题,某些副本与其他副本断开连接,那么在断开连接期间,某些副本无法处理请求:它们必须等待网络问题修复,或者返回错误(无论哪种方式,它们都变得不可用)。这种选择有时被称为 CP(网络分区下的一致性)。 - If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable. This choice is known as AP (available under network partitions).
如果你的应用程序不需要线性化,那么它可以通过一种方式编写,即每个副本可以独立处理请求,即使它与其它副本断开连接(例如,多领导者)。在这种情况下,应用程序在网络问题面前可以保持可用,但其行为无法线性化。这种选择被称为 AP(网络分区下的可用性)。
Thus, applications that don’t require linearizability can be more tolerant of network problems. This insight is popularly known as the CAP theorem [29,30,31,32], named by Eric Brewer in 2000, although the trade-off had been known to designers of distributed databases since the 1970s [33,34,35].
因此,不需要线性化的应用程序对网络问题具有更强的容忍度。这一见解被广泛称为 CAP 定理 [29, 30, 31, 32],由 Eric Brewer 于 2000 年命名,尽管自 1970 年代以来分布式数据库的设计师们就已经知道这种权衡 [33, 34, 35]。
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases. At the time, many distributed databases focused on providing linearizable semantics on a cluster of machines with shared storage [19], and CAP encouraged database engineers to explore a wider design space of distributed shared-nothing systems, which were more suitable for implementing large-scale web services [36]. CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new database technologies around the mid-2000s.
CAP 最初被提出时只是一个经验法则,没有精确的定义,其目的是引发关于数据库权衡的讨论。当时,许多分布式数据库专注于在共享存储的机器集群上提供线性化语义 [19],而 CAP 鼓励数据库工程师探索更适合实现大规模网络服务的分布式无共享系统设计空间 [36]。CAP 为这种文化转变应得赞誉 —— 它帮助触发了 NoSQL 运动,即 2000 年代中期涌现的一批新型数据库技术。
The CAP theorem as formally defined [30] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions, which according to data from Google are the cause of less than 8% of incidents [41]). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [4,38].
根据正式定义 [30],CAP 定理的范围非常狭窄:它只考虑一种一致性模型(即可线性化)和一种故障类型(即网络分区,根据谷歌的数据,这导致的事故不到 8%[41])。它没有提及网络延迟、死节点或其他权衡。因此,尽管 CAP 在历史上具有重要影响,但它对系统设计几乎没有实际价值 [4, 38]。
There have been efforts to generalize CAP. For example, the PACELC principle observes that system designers might also choose to weaken consistency at times when the network is working fine in order to reduce latency [39,40,42]. Thus, during a network partition (P), we need to choose between availability (A) and consistency (C); else (E), when there is no partition, we may choose between low latency (L) and consistency (C). However, this definition inherits several problems with CAP, such as the counterintuitive definitions of consistency and availability.
人们曾尝试对 CAP 进行泛化。例如,PACELC 原则观察到,系统设计者有时也可能选择在网络运行良好时减弱一致性,以减少延迟 [39, 40, 42]。因此,在网络分区(P)期间,我们需要在可用性(A)和一致性(C)之间进行选择;否则(E),当没有分区时,我们可以在低延迟(L)和一致性(C)之间进行选择。然而,这种定义继承了 CAP 的几个问题,例如一致性和可用性的反直觉定义。
There are many more interesting impossibility results in distributed systems [43], and CAP has now been superseded by more precise results [44,45], so it is of mostly historical interest today.
分布式系统中还有许多更有趣的不可能性结果 [43],而 CAP 现在已被更精确的结果所取代 [44,45],因此它今天基本上只有历史意义。
Linearizability and network delays 线性化与网络延迟#
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable [46]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a memory barrier or fence [47] is used).
尽管线性化性是一个有用的保证,但令人惊讶的是,实际上很少有系统是线性可化的。例如,即使现代多核 CPU 上的 RAM 也不是线性可化的 [46]:如果一个线程在一个 CPU 核心上写入一个内存地址,另一个 CPU 核心上的线程稍后读取同一个地址,并不能保证读取到第一个线程写入的值(除非使用内存屏障或栅栏 [47])。
The reason for this behavior is that every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory. Since accessing data in the cache is much faster than going to main memory [48], this feature is essential for good performance on modern CPUs. However, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost.
这种行为的原因是每个 CPU 核心都有自己的内存缓存和存储缓冲区。内存访问默认首先访问缓存,并且任何更改都会异步写入主内存。由于在缓存中访问数据比访问主内存快得多 [48],这一特性对于现代 CPU 的良好性能至关重要。然而,现在数据有多个副本(一个在主内存中,可能还有几个在不同的缓存中),并且这些副本异步更新,因此线性化性会丢失。
Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance, not fault tolerance [39].
为什么要做这种权衡?用 CAP 定理来为多核内存一致性模型辩护毫无意义:在一台计算机内部,我们通常假设通信可靠,而且我们不会期望一个 CPU 核心在断开与其他部分的连接后还能正常工作。放弃线性化性的原因是为了性能,而不是容错性 [39]。
The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance [42]. Linearizability is slow—and this is true all the time, not only during a network fault.
许多选择不提供线性化保证的分布式数据库也是如此:它们这样做主要是为了提高性能,而不是为了容错性 [42]。线性化性很慢 —— 而且这一点始终成立,不仅仅是在网络故障期间。
Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is no: Attiya and Welch [49] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see “Timeouts and Unbounded Delays”), the response time of linearizable reads and writes is inevitably going to be high. A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding linearizability without sacrificing correctness.
我们难道不能找到一种更高效的线性化存储实现方式吗?答案似乎是否定的:Attiya 和 Welch [49] 证明,如果你想要线性化,读和写请求的响应时间至少要和网络延迟的不确定性成正比。在延迟变化很大的网络中,比如大多数计算机网络(参见 “超时和无界延迟”),线性化读和写的响应时间不可避免地会很高。虽然不存在更快的线性化算法,但更弱的并发性模型可以快得多,因此这种权衡对延迟敏感的系统非常重要。在 [待链接] 中,我们将讨论一些避免线性化而不牺牲正确性的方法。
ID Generators and Logical ClocksID 生成器和逻辑时钟#
In many applications you need to assign some sort of unique ID to database records when they are created, which gives you a primary key by which you can refer to those records. In single-node databases it is common to use an auto-incrementing integer, which has the advantage that it can be stored in only 64 bits (or even 32 bits if you are sure that you will never have more than 4 billion records, but that is risky).
在许多应用中,当数据库记录创建时,你需要为其分配某种唯一 ID,这会给你一个可以通过它来引用这些记录的主键。在单节点数据库中,通常使用自增整数,它的优点是只需 64 位(或者如果你确定永远不会超过 40 亿条记录,甚至可以用 32 位,但这有风险)即可存储。
Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in which the records were created. For example, Figure 10-8 shows a chat application that assigns auto-incrementing IDs to chat messages as they are posted. You can then display the messages in order of increasing ID, and the resulting chat threads will make sense: Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a greater ID, namely 3.
这种自增 ID 的另一个优点是,ID 的顺序会告诉你记录的创建顺序。例如,图 10-8 展示了一个聊天应用,该应用在发布聊天消息时为其分配自增 ID。然后你可以按 ID 递增的顺序显示消息,这样生成的聊天线程就会合乎逻辑:Aaliyah 发布了一个被分配 ID 1 的问题,而 Bryce 对该问题的回答被分配了更大的 ID,即 3。

This single-node ID generator is another example of a linearizable system. Each request to fetch the ID is an operation that atomically increments a counter and returns the old counter value (a fetch-and-add operation); linearizability ensures that if the posting of Aaliyah’s message completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The messages by Aaliyah and Caleb in Figure 10-8 are concurrent, so linearizability doesn’t specify how their IDs must be ordered, as long as they are unique.
这个单节点 ID 生成器是线性化系统的一个例子。每次获取 ID 的请求都是一个原子性递增计数器并返回旧计数器值的操作(一个获取并添加操作);线性化确保如果 Aaliyah 的消息发布在 Bryce 发布之前完成,那么 Bryce 的 ID 必须大于 Aaliyah 的。图 10-8 中 Aaliyah 和 Caleb 的消息是并发的,因此线性化并不指定它们的 ID 必须如何排序,只要它们是唯一的。
An in-memory single-node ID generator is easy to implement: you can use the atomic increment instruction provided by your CPU, which allows multiple threads to safely increment the same counter. It’s a bit more effort to make the counter persistent, so that the node can crash and restart without resetting the counter value, which would result in duplicate IDs. But the real problems are:
内存中单节点 ID 生成器易于实现:你可以使用 CPU 提供的原子递增指令,这允许多个线程安全地递增同一个计数器。让计数器持久化需要更多的工作,以便节点可以崩溃重启而不重置计数器值,否则会导致 ID 重复。但真正的问题是:
- A single-node ID generator is not fault-tolerant because that node is a single point of failure.
单节点 ID 生成器不具有容错性,因为该节点是一个单点故障。 - It’s slow if you want to create a record in another region, as you potentially have to make a round-trip to the other side of the planet just to get an ID.
如果你想在另一个区域创建记录,它会很慢,因为你可能需要往返地球另一端才能获取一个 ID。 - That single node could become a bottleneck if you have high write throughput.
如果写入吞吐量很高,这个单节点可能会成为瓶颈。
There are various alternative options for ID generators that you can consider:
对于 ID 生成器,你可以考虑以下几种替代方案:
Sharded ID assignment 分片 ID 分配
You could have multiple nodes that assign IDs—for example, one that generates only even numbers, and one that generates only odd numbers. In general, you can reserve some bits in the ID to contain a shard number. Those IDs are still compact, but you lose the ordering property: for example, if you have chat messages with IDs 16 and 17, you don’t know whether message 16 was actually sent first, because the IDs were assigned by different nodes, and one node might have been ahead of the other.
你可以让多个节点分配 ID,例如,一个节点只生成偶数 ID,另一个节点只生成奇数 ID。通常,你可以在 ID 中预留一些位来包含分片编号。这些 ID 仍然紧凑,但你会失去顺序属性:例如,如果你有 ID 为 16 和 17 的聊天消息,你不知道消息 16 是否实际上先发送的,因为 ID 是由不同的节点分配的,其中一个节点可能领先于另一个节点。
Preallocated blocks of IDs
预分配的 ID 块
Instead of requesting individual IDs from the single-node ID generator, it could hand out blocks of IDs. For example, node A might claim the block of IDs from 1 to 1,000, and node B might claim the block from 1,001 to 2,000. Then each node can independently hand out IDs from its block, and request a new block from the single-node ID generator when its supply of sequence numbers begins to run low. However, this scheme doesn’t ensure correct ordering either: it could happen that one message is given an ID in the range from 1,001 to 2,000, and a later message is given an ID in the range from 1 to 1,000 if the ID was assigned by a different node.
与其从单节点 ID 生成器请求单个 ID,它可以直接分配 ID 块。例如,节点 A 可能声明从 1 到 1000 的 ID 块,而节点 B 可能声明从 1001 到 2000 的 ID 块。然后每个节点可以独立地从其 ID 块中分配 ID,并在其序列号供应开始不足时向单节点 ID 生成器请求新的 ID 块。然而,这种方案也不能保证正确的顺序:如果 ID 是由不同的节点分配的,可能会发生一个消息被分配了 1001 到 2000 范围内的 ID,而一个稍后的消息被分配了 1 到 1000 范围内的 ID 的情况。
Random UUIDs 随机 UUID
You can use universally unique identifiers (UUIDs), also known as globally unique identifiers (GUIDs). These have the big advantage that they can be generated locally on any node without requiring communication, but they require more space (128 bits). There are several different versions of UUIDs; the simplest is version 4, which is essentially a random number that is so long that is very unlikely that two nodes would ever pick the same one. Unfortunately, the order of such IDs is also random, so comparing two IDs tells you nothing about which one is newer.
您可以使用通用唯一标识符(UUID),也称为全局唯一标识符(GUID)。这些具有一个很大的优点,即它们可以在任何节点上本地生成,而无需通信,但它们需要更多的空间(128 位)。UUID 有几种不同的版本;最简单的是版本 4,它本质上是一个随机数,长度如此之长,以至于两个节点几乎不可能选择相同的值。不幸的是,这些 ID 的顺序也是随机的,因此比较两个 ID 并不能告诉你哪个更新。
Wall-clock timestamp made unique
使用时钟时间戳使其唯一
If your nodes’ time-of-day clock is kept approximately correct using NTP, you can generate IDs by putting a timestamp from that clock in the most significant bits, and filling the remaining bits with extra information that ensures the ID is unique even if the timestamp is not—for example, a shard number and a per-shard incrementing sequence number, or a long random value. This approach is used in Version 7 UUIDs [50], Twitter’s Snowflake [51], ULIDs [52], Hazelcast’s Flake ID generator, MongoDB ObjectIDs, and many similar schemes [50]. You can implement these ID generators in application code or within a database [53].
如果你的节点的时间时钟通过 NTP 保持大致正确,你可以通过将时钟的时间戳放入最高有效位,并用额外的信息填充剩余的位来生成 ID,这些信息确保即使时间戳不正确,ID 也是唯一的 —— 例如,一个分片编号和每个分片的递增序列号,或一个长随机值。这种方法在版本 7 UUID [50]、Twitter 的 Snowflake [51]、ULIDs [52]、Hazelcast 的 Flake ID 生成器、MongoDB ObjectIDs 以及许多类似的方案 [50] 中使用。你可以在应用代码中实现这些 ID 生成器,或是在数据库中实现 [53]。
All these schemes generate IDs that are unique (at least with high enough probability that collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the single-node auto-incrementing scheme.
所有这些方案生成的 ID 都是唯一的(至少具有足够高的概率,使得冲突极为罕见),但它们对 ID 的排序保证比单节点自增方案弱得多。
As discussed in “Timestamps for ordering events”, wall-clock timestamps can provide at best an approximate ordering: if an earlier write gets a timestamp from a slightly fast clock, and a later write’s timestamp is from a slightly slow clock, the timestamp order may be inconsistent with the order in which the events actually happened. With clock jumps due to using a non-monotonic clock, even the timestamps generated by a single node might be ordered incorrectly. ID generators based on wall-clock time are therefore unlikely to be linearizable.
如 “用于排序事件的时戳” 中所述,墙上时钟时戳最多只能提供近似排序:如果较早的写入操作从略微快的时钟获取时戳,而较晚的写入操作的时戳来自略微慢的时钟,时戳顺序可能与事件实际发生的顺序不一致。由于使用非单调时钟导致的时钟跳变,即使单个节点生成的时戳也可能顺序错误。因此,基于墙上时钟的 ID 生成器不太可能实现线性化。
You can reduce such ordering inconsistencies by relying on high-precision clock synchronization, using atomic clocks or GPS receivers. But it would also be nice to be able to generate IDs that are unique and correctly ordered without relying on special hardware. That’s what logical clocks are about.
你可以通过依赖高精度时钟同步来减少这种排序不一致,使用原子钟或 GPS 接收器。但也很想能够生成唯一且正确排序的 ID,而不依赖特殊硬件。这就是逻辑时钟的作用。
Linearizable ID Generators 线性化 ID 生成器#
Although Lamport clocks and hybrid logical clocks provide useful ordering guarantees, that ordering is still weaker than the linearizable single-node ID generator we talked about previously. Recall that linearizability requires that if request A completed before request B began, then B must have the higher ID, even if A and B never communicated with each other. On the other hand, Lamport clocks can only ensure that a node generates timestamps that are greater than any other timestamp that node has seen, but it can’t say anything about timestamps that it hasn’t seen.
尽管 Lamport 时钟和混合逻辑时钟提供了有用的排序保证,但这种排序仍然比我们之前讨论的线性化单节点 ID 生成器弱。回想一下,线性化要求如果请求 A 在请求 B 开始之前完成,那么 B 必须具有更高的 ID,即使 A 和 B 从未相互通信。另一方面,Lamport 时钟只能确保一个节点生成的时戳大于该节点所见过的任何其他时戳,但它无法对它未见过时戳的情况做出任何说明。
Figure 10-10 shows how a non-linearizable ID generator could cause problems. Imagine a social media website where user A wants to share an embarrassing photo privately with their friends. A’s account is initially public, but using their laptop, A first changes their account settings to private. Then A uses their phone to upload the photo. Since A performed these updates in sequence, they might reasonably expect the photo upload to be subject to the new, restricted account permissions.
图 10-10 展示了非线性化 ID 生成器可能引发的问题。想象一个社交媒体网站,用户 A 想要私下与朋友分享一张尴尬的照片。A 的账户最初是公开的,但使用笔记本电脑,A 首先更改了账户设置改为私密。然后 A 使用手机上传照片。由于 A 按顺序执行了这些更新,他们可能会合理地期望照片上传受新的、受限的账户权限管理。

The account permission and the photo are stored in two separate databases (or separate shards of the same database), and let’s assume they use a Lamport clock or hybrid logical clock to assign a timestamp to every write. Since the photos database didn’t read from the accounts database, it’s possible that the local counter in the photos database is slightly behind, and therefore the photo upload is assigned a lower timestamp than the update of the account settings.
账户权限和照片存储在两个独立的数据库(或同一数据库的独立分片)中,并且假设它们使用 Lamport 时钟或混合逻辑时钟为每次写入分配时间戳。由于照片数据库没有从账户数据库读取,因此照片数据库中的本地计数器可能稍微落后,因此照片上传分配的时间戳可能低于账户设置更新的时间戳。
Next, let’s say that a viewer (who is not friends with A) is looking at A’s profile, and their read uses an MVCC implementation of snapshot isolation. It could happen that the viewer’s read has a timestamp that is greater than that of the photo upload, but less than that of the account settings update. As a result, the system will determine that the account is still public at the time of the read, and therefore show the viewer the embarrassing photo that they were not supposed to see.
接下来,假设一个查看者(A 的好友)正在查看 A 的个人资料,并且他们的读取使用了快照隔离的 MVCC 实现。可能会发生查看者的读取时间戳大于照片上传时间戳,但小于账户设置更新时间戳的情况。结果,系统将判定在读取时账户仍然是公开的,因此向查看者展示了他们本不该看到的那张尴尬照片。
You can imagine several possible ways of fixing this problem. Maybe the photos database should have read the user’s account status before performing the write, but it’s easy to forget such a check. If A’s actions had been performed on the same device, maybe the app on their device could have tracked the latest timestamp of that user’s writes—but if the user uses a laptop and a phone, as in the example, that’s not so easy.
你可以想象几种可能的解决这个问题的方法。也许照片数据库在执行写入操作前应该读取用户的账户状态,但很容易忘记这种检查。如果 A 的操作是在同一设备上执行的,也许他们设备上的应用程序可以跟踪该用户最新的写入时间戳 —— 但就像示例中那样,如果用户使用笔记本电脑和手机,那就没那么容易了。
The simplest solution in this case would be to use a linearizable ID generator, which would ensure that the photo upload is assigned a greater ID than the account permissions change.
在这种情况下,最简单的解决方案是使用线性化 ID 生成器,这将确保照片上传被分配一个比账户权限变更更大的 ID。
Implementing a linearizable ID generator 实现线性化 ID 生成器#
The simplest way of ensuring that ID assignment is linearizable is by actually using a single node for this purpose. That node only needs to atomically increment a counter and return its value when requested, persist the counter value (so that it doesn’t generate duplicate IDs if the node crashes and restarts), and replicate it for fault tolerance (using single-leader replication). This approach is used in practice: for example, TiDB/TiKV calls it a timestamp oracle, inspired by Google’s Percolator [57].
确保 ID 分配是线性化的最简单方法实际上是通过使用单个节点来实现。该节点只需要原子性地递增一个计数器并在被请求时返回其值,持久化计数器值(以便在节点崩溃并重启时不会生成重复的 ID),并通过单领导复制进行复制以实现容错。这种方法在实践中被使用:例如,TiDB / TiKV 将其称为时间戳预言机,灵感来自 Google 的 Percolator [57]。
As an optimization, you can avoid performing a disk write and replication on every single request. Instead, the ID generator can write a record describing a batch of IDs; once that record is persisted and replicated, the node can start handing out those IDs to clients in sequence. Before it runs out of IDs in that batch, it can persist and replicate the record for the next batch. That way, some IDs will be skipped if the node crashes and restarts or if you fail over to a follower, but you won’t issue any duplicate or out-of-order IDs.
作为优化,你可以避免在每次请求中都执行磁盘写入和复制。相反,ID 生成器可以写入一条记录来描述一批 ID;一旦该记录被持久化并复制,节点就可以按顺序将这些 ID 分发给客户端。在该批次 ID 用完之前,它可以持久化并复制下一批次的记录。这样,如果节点崩溃重启或你切换到从节点,某些 ID 可能会被跳过,但你不会发出任何重复或顺序错误的 ID。
You can’t easily shard the ID generator, since if you have multiple shards independently handing out IDs, you can no longer guarantee that their order is linearizable. You also can’t easily distribute the ID generator across multiple regions; thus, in a geographically distributed database, all requests for IDs will have to go to a node in a single region. On the upside, the ID generator’s job is very simple, so a single node can handle a large request throughput.
你无法轻易地将 ID 生成器分片,因为如果多个分片独立地分发 ID,你将无法再保证它们的顺序是可线性化的。你也无法轻易地将 ID 生成器分布在多个区域;因此,在地理分布式数据库中,所有对 ID 的请求都将必须发送到单个区域中的一个节点。不过,ID 生成器的任务非常简单,所以单个节点可以处理大量的请求吞吐量。
If you don’t want to use a single-node ID generator, an alternative is possible: you can do what Google’s Spanner does, as discussed in “Synchronized clocks for global snapshots”. It relies on a physical clock that returns not just a single timestamp, but a range of timestamps indicating the uncertainty in the clock reading. It then waits for the duration of that uncertainty interval to elapse before returning.
如果你不想使用单节点 ID 生成器,有一个替代方案:你可以像 Google 的 Spanner 那样做,正如在 “为全局快照同步时钟” 中讨论的那样。它依赖于一个物理时钟,这个时钟不仅返回一个时间戳,还返回一个时间戳范围,表示时钟读数的不可确定性。然后它等待这个不可确定性间隔过去后再返回。
Assuming that the uncertainty interval is correct (i.e., that the true current physical time always lies within that interval), this process also ensures that if one request completes before another begins, the later request will have a greater timestamp. This approach ensures this linearizable ID assignment without any communication: even requests in different regions will be ordered correctly, without waiting for cross-region requests. The downside is that you need hardware and software support for clocks to be tightly synchronized and compute the necessary uncertainty interval.
假设不可确定性间隔是正确的(即,真实的当前物理时间始终位于该间隔内),这个过程也确保了如果有一个请求在另一个请求开始之前完成,那么后来的请求将会有一个更大的时间戳。这种方法确保了这种线性化 ID 分配,而无需任何通信:即使请求位于不同区域,也会被正确排序,而无需等待跨区域请求。缺点是你需要硬件和软件支持,以使时钟紧密同步并计算必要的不可确定性间隔。
Enforcing constraints using logical clocks 使用逻辑时钟强制约束#
In “Constraints and uniqueness guarantees” we saw that a linearizable compare-and-set operation can be used to implement locks, uniqueness constraints, and similar constructs in a distributed system. This raises the question: is a logical clock or a linearizable ID generator also sufficient to implement these things?
在 “约束和唯一性保证” 中,我们看到线性化比较并设置操作可以用于在分布式系统中实现锁、唯一性约束和类似结构。这引发了问题:逻辑时钟或线性化 ID 生成器是否也足够实现这些功能?
The answer is: not quite. When you have several nodes that are all trying to acquire the same lock or register the same username, you could use a logical clock to assign timestamps to those requests, and pick the one with the lowest timestamp as the winner. If the clock is linearizable, you know that any future requests will always generate greater timestamps, and therefore you can be sure that no future request will receive an even lower timestamp than the winner.
答案是:并不完全如此。当有多个节点都在尝试获取同一个锁或注册同一个用户名时,你可以使用逻辑时钟为这些请求分配时间戳,并选择时间戳最低的那个作为获胜者。如果时钟是线性化的,你知道任何未来的请求都将始终生成更大的时间戳,因此你可以确信没有任何未来的请求会获得比获胜者更低的时间戳。
Unfortunately, part of the problem is still unsolved: how does a node know whether its own timestamp is the lowest? To be sure, it needs to hear from every other node that might have generated a timestamp [54]. If one of the other nodes has failed in the meantime, or cannot be reached due to a network problem, this system would grind to a halt, because we can’t be sure whether that node might have the lowest timestamp. This is not the kind of fault-tolerant system that we need.
不幸的是,这个问题仍然没有完全解决:节点如何知道自己的时间戳是否最低?要确保这一点,它需要从所有可能生成时间戳的其他节点那里收到消息 [54]。如果其他节点在 meantime 发生故障,或者由于网络问题无法到达,这个系统就会陷入停滞,因为我们无法确定那个节点是否可能拥有最低的时间戳。这并不是我们需要的那种容错系统。
To implement locks, leases, and similar constructs in a fault-tolerant way, we need something stronger than logical clocks or ID generators: we need consensus.
为了以容错的方式实现锁、租约和类似的构造,我们需要比逻辑时钟或 ID 生成器更强的东西:我们需要共识。
Consensus 一致性#
In this chapter we have seen several examples of things that are easy when you have only a single node, but which get a lot harder if you want fault tolerance:
在本章中,我们看到几个在只有一个节点时很容易的事情,但在需要容错时却变得非常困难:
- A database can be linearizable if you have only a single leader, and you make all reads and writes on that leader. But how do you fail over if that leader fails, while avoiding split brain? How do you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the meantime?
如果只有一个领导者,并且所有读和写都在该领导者上执行,那么数据库可以是线性化的。但如果该领导者失败,如何进行故障转移同时避免脑裂?如何确保一个认为自己是领导者的节点实际上并没有在 meantime 被投票出去? - A linearizable ID generator on a single node is just a counter with an atomic fetch-and-add instruction, but what if it crashes?
单个节点上的线性化 ID 生成器只是一个带有原子获取并加指令的计数器,但如果它崩溃了呢? - An atomic compare-and-set (CAS) operation is useful for many things, such as deciding who gets a lock or lease when several processes are racing to acquire it, or ensuring the uniqueness of a file or user with a given name. On a single node, CAS may be as simple as one CPU instruction, but how do you make it fault-tolerant?
一个原子比较并设置(CAS)操作有很多用途,例如在多个进程竞争获取锁或租约时决定谁获得它,或确保具有给定名称的文件或用户的唯一性。在单个节点上,CAS 可能只是一个 CPU 指令那么简单,但如何让它具有容错性?
It turns out that all of these are instances of the same fundamental distributed systems problem:consensus. Consensus is one of the most important and fundamental problems in distributed computing; it is also infamously difficult to get right [58,59], and many systems have got it wrong in the past. Now that we have discussed replication (Chapter 6), transactions (Chapter 8), system models (Chapter 9), and linearizability (this chapter), we are finally ready to tackle the consensus problem.
事实证明,所有这些都是同一个基本分布式系统问题 —— 一致性的实例。一致性是分布式计算中最重要和最基本的问题之一;它也以正确实现起来臭名昭著 [58, 59],并且过去许多系统都未能正确处理它。现在我们已经讨论了复制(第 6 章)、事务(第 8 章)、系统模型(第 9 章)和线性化(本章),我们终于准备好解决一致性问题了。
The best-known consensus algorithms are Viewstamped Replication [60,61], Paxos [58,62,63,64], Raft [23,65,66], and Zab [18,22,67]. There are quite a few similarities between these algorithms, but they are not the same [68,69]. These algorithms work in a non-Byzantine system model: that is, network communication may be arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
最著名的共识算法有 Viewstamped Replication [60, 61]、Paxos [58, 62, 63, 64]、Raft [23, 65, 66] 和 Zab [18, 22, 67]。这些算法之间有许多相似之处,但它们并不相同 [68, 69]。这些算法工作在非拜占庭系统模型中:也就是说,网络通信可能会任意延迟或丢失,节点可能会崩溃、重启和断开连接,但算法假设节点在其他方面正确遵循协议,并且不会表现出恶意行为。
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t correctly follow the protocol (for example, by sending contradictory messages to other nodes). A common assumption is that fewer than one-third of the nodes are Byzantine-faulty [26,70]. Such Byzantine fault tolerant (BFT) consensus algorithms are used in blockchains [71]. However, as explained in “Byzantine Faults”, BFT algorithms are beyond the scope of this book.
还有一些可以容忍某些拜占庭节点的共识算法,即那些不正确遵循协议的节点(例如,通过向其他节点发送矛盾消息)。一个常见的假设是,少于三分之一的节点是拜占庭有故障的 [26, 70]。这种拜占庭容错(BFT)共识算法用于区块链 [71]。然而,正如 “拜占庭故障” 中解释的那样,BFT 算法超出了本书的范围。
The Many Faces of Consensus 共识的多种面貌#
Consensus can be expressed in several different ways:
共识可以以多种不同的方式表达:
- Single-value consensus is very similar to an atomic compare-and-set operation, and it can be used to implement locks, leases, and uniqueness constraints.
单值共识非常类似于原子比较并设置操作,它可以用来实现锁、租约和唯一性约束。 - Constructing an append-only log also requires consensus; it is usually formalized as total order broadcast. With a log you can build state machine replication, leader-based replication, event sourcing, and other useful things.
构建只追加日志也需要共识;它通常被形式化为全序广播。有了日志,你可以构建状态机复制、基于领导的复制、事件溯源以及其他有用的事物。 - Atomic commitment of a multi-database or multi-shard transaction requires that all participants agree on whether to commit or abort the transaction.
多数据库或多分片的交易原子提交要求所有参与者就提交或中止交易达成一致。
We will explore all of these shortly. In fact, these problems are all equivalent to each other: if you have an algorithm that solves one of these problems, you can convert it into a solution for any of the others. This is quite a profound and perhaps surprising insight! And that’s why we can lump all of these things together under “consensus”, even though they look quite different on the surface.
我们将很快探讨所有这些问题。事实上,这些问题彼此都等价:如果你有一个解决其中任何一个问题的算法,你都可以将其转换为解决其他任何问题的方案。这是一个非常深刻,或许令人惊讶的见解!这就是为什么我们能把所有这些事情统称为 “共识”,尽管它们表面上看起来差异很大。
Single-value consensus 单值共识#
The standard formulation of consensus involves getting multiple nodes to agree on a single value. For example:
共识的标准表述涉及让多个节点就一个单一值达成一致。例如:
- When a database with single-leader replication first starts up, or when the existing leader fails, several nodes may concurrently try to become the leader. Similarly, multiple nodes may race to acquire a lock or lease. Consensus allows them to decide which one wins.
当具有单领导复制的数据库首次启动,或现有领导节点失败时,多个节点可能同时尝试成为领导节点。类似地,多个节点可能争相获取锁或租约。共识允许它们决定哪个胜出。 - If several people concurrently try to book the last seat on an airplane, or the same seat in a theater, or try to register an account with the same username, then a consensus algorithm could determine which one should succeed.
如果几个人同时尝试预订飞机的最后一张座位,或剧院的同一张座位,或尝试用相同的用户名注册账户,那么一致性算法可以确定哪一个应该成功。
More generally, one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the examples above, each node could propose its own ID, and the algorithm decides which node ID should become the new leader, the holder of the lease, or the buyer of the airplane/theater seat. In this formalism, a consensus algorithm must satisfy the following properties [26]:
更一般地,一个或多个节点可以提出值,而一致性算法决定其中一个值。在上述例子中,每个节点可以提出自己的 ID,算法决定哪个节点 ID 应该成为新的领导者、租约持有者,或飞机 / 剧院座位的购买者。在这个形式化中,一致性算法必须满足以下属性 [26]:
Uniform agreement 统一协议
No two nodes decide differently.
没有两个节点做出不同的决定。
Integrity 完整性
Once a node has decided one value, it cannot change its mind by deciding another value.
一旦一个节点决定了一个值,它就不能通过决定另一个值来改变主意。
Validity 有效性
If a node decides value v, then v was proposed by some node.
如果一个节点决定值 v,那么 v 是由某个节点提出的。
Termination 终止
Every node that does not crash eventually decides some value.
每个不会崩溃的节点最终都会决定某个值。
If you want to decide multiple values, you can run a separate instance of the consensus algorithm for each. For example, you could have a separate consensus run for each bookable seat in the theater, so that you get one decision (one buyer) for each seat.
如果你想决定多个值,可以为每个值运行一个独立的共识算法实例。例如,可以为每个可预订的座位运行一个独立的共识实例,这样每个座位都会得到一个决定(一个买家)。
The uniform agreement and integrity properties define the core idea of consensus: everyone decides on the same outcome, and once you have decided, you cannot change your mind. The validity property rules out trivial solutions: for example, you could have an algorithm that always decides null, no matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but not the validity property.
一致性协议和完整性属性定义了共识的核心思想:每个人都对相同的结果做出决定,一旦做出决定,就不能改变主意。有效性属性排除了琐碎的解决方案:例如,你可以有一个总是决定 null 的算法,无论提出什么;这个算法会满足一致性和完整性属性,但不会满足有效性属性。
If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, if that one node fails, then the system can no longer make any decisions—just like single-leader replication without failover. All the difficulty arises from the need for fault tolerance.
如果你不关心容错性,那么满足前三个属性是很容易的:你可以直接将一个节点硬编码为 “独裁者”,让该节点做出所有决策。然而,如果这个节点失败了,系统就不再能做出任何决策 —— 就像没有故障转移的单领导者复制一样。所有的困难都源于对容错性的需求。
The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a liveness property, whereas the other three are safety properties—see “Safety and liveness”.)
终止属性形式化了容错性的概念。它本质上说的是,一个共识算法不能简单地无所事事地永远坐等 —— 换句话说,它必须取得进展。即使有些节点失败了,其他节点仍然必须达成一个决策。(终止是一个活性属性,而其他三个是安全性属性 —— 参见 “安全性与活性”。)
If a crashed node may recover, you could just wait for it to come back. However, consensus must ensure that it makes a decision even if a crashed node suddenly disappears and never comes back. (Instead of a software crash, imagine that there is an earthquake, and the datacenter containing your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud and is never going to come back online.)
如果一个崩溃的节点可能恢复,你只需等待它回来。然而,共识必须确保即使一个崩溃的节点突然消失且不再回来,它也能做出决策。(而不是软件崩溃,想象一下发生地震,包含你的节点的数据中心被山体滑坡摧毁。你必须假设你的节点被 30 英尺厚的泥浆掩埋,且永远无法上线。)
Of course, if all nodes crash and none of them are running, then it is not possible for any algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination [73]. That majority can safely form a quorum (see “Quorums for reading and writing”).
当然,如果所有节点都崩溃且没有任何节点在运行,那么任何算法都无法做出任何决策。任何算法都能容忍的故障数量是有限的:事实上,可以证明任何共识算法至少需要多数节点正常运行才能保证终止 [73]。这个多数可以安全地形成一个仲裁组(参见 “读写仲裁组的设置”)。
Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable. However, most consensus algorithms ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem [75]. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make inconsistent decisions.
因此,终止属性取决于假设少于一半的节点崩溃或无法访问。然而,大多数共识算法确保安全属性 —— 一致性、完整性和有效性 —— 始终得到满足,即使大多数节点失效或存在严重的网络问题 [75]。因此,大规模停机可能会阻止系统处理请求,但它不能通过导致系统做出不一致的决策来破坏共识系统。
Compare-and-set as consensus 比较并设置作为共识#
A compare-and-set (CAS) operation checks whether the current value of some object equals some expected value; if yes, it atomically updates the object to some new value; if no, it leaves the object unchanged and returns an error.
比较并设置(CAS)操作检查某个对象的当前值是否等于某个期望值;如果相等,它原子性地将对象更新为新值;如果不相等,它保持对象不变并返回错误。
If you have a fault-tolerant, linearizable CAS operation, it is easy to solve the consensus problem: initially set the object to a null value; each node that wants to propose a value invokes CAS with the expected value being null, and the new value being the value it wants to propose (assuming it is non-null). The decided value is then whatever value the object is set to.
如果你有一个容错的线性化 CAS 操作,解决一致性问题是很容易的:初始时将对象设置为空值;每个想要提议一个值的节点调用 CAS,期望值为空,新值为它想要提议的值(假设它非空)。最终决定值就是对象被设置的那个值。
Likewise, if you have a solution for consensus, you can implement CAS: whenever one or more nodes want to perform CAS with the same expected value, you use the consensus protocol to propose the new values in the CAS invocation, and then set the object to whatever value was decided by the consensus. Any CAS invocations whose new value was not decided return an error. CAS invocations with different expected values use separate runs of the consensus protocol.
同样地,如果你有一个解决一致性问题的方案,你可以实现 CAS:每当一个或多个节点想要以相同的期望值执行 CAS 操作时,你使用一致性协议来提议 CAS 调用中的新值,然后将对象设置为一致性协议决定的那个值。任何新值未被决定 CAS 调用都返回错误。期望值不同的 CAS 调用使用一致性协议的不同运行。
This shows that CAS and consensus are equivalent to each other [28,73]. Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an example of CAS in a distributed setting, we saw conditional write operations for object stores in “Databases backed by object storage”, which allow a write to happen only if an object with the same name has not been created or modified by another client since the current client last read it.
这表明 CAS(比较并交换)和一致性协议是等价的 [28, 73]。同样,它们在单个节点上都很简单,但在实现容错性时具有挑战性。作为 CAS 在分布式环境中的一个例子,我们在 “基于对象存储的数据库” 中看到了条件写操作,这些操作允许只有在当前客户端上次读取后,另一个客户端没有创建或修改具有相同名称的对象时,才允许写入。
However, a linearizable read-write register is not sufficient to solve consensus. The FLP result tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop model [72], but we saw in “Linearizability and quorums” that a linearizable register can be implemented using quorum reads/writes in this model [24,25, 26]. From this it follows that a linearizable register cannot solve consensus.
然而,可线性化读写寄存器不足以解决一致性协议问题。FLP 定理告诉我们,在异步崩溃停止模型中,一致性协议不能通过确定性算法解决 [72],但在 “可线性化性和仲裁” 中我们看到,在这个模型中可以使用仲裁读 / 写来实现可线性化寄存器 [24, 25, 26]。由此可知,可线性化寄存器不能解决一致性协议问题。
Shared logs as consensus 共享日志作为一致性协议#
We have seen several examples of logs, such as replication logs, transaction logs, and write-ahead logs. A log stores a sequence of log entries, and anyone who reads it sees the same entries in the same order. Sometimes a log has a single writer that is allowed to append new entries, but a shared log is one where multiple nodes can request entries to be appended. An example is single-leader replication: any client can ask the leader to make a write, which the leader appends to the replication log, and then all followers apply the writes in the same order as the leader.
我们看到了多个日志的例子,例如复制日志、事务日志和预写日志。一个日志存储一系列日志条目,任何读取它的人都会看到相同的条目按相同顺序排列。有时一个日志只有一个允许追加新条目的写入者,但共享日志是多个节点可以请求追加条目的日志。一个例子是单领导者复制:任何客户端都可以请求领导者执行写入,领导者将写入追加到复制日志中,然后所有跟随者按领导者的相同顺序应用写入。
More formally, a shared log supports two operations: you can request for a value to be added to the log, and you can read the entries in the log. It must satisfy the following properties:
更正式地说,共享日志支持两种操作:你可以请求将一个值添加到日志中,你可以读取日志中的条目。它必须满足以下属性:
Eventual append 最终追加
If a node requests for some value to be added the log, and the node does not crash, then that node must eventually read that value in a log entry.
如果一个节点请求将某个值添加到日志中,并且该节点没有崩溃,那么该节点最终必须在某个日志条目中读取该值。
Reliable delivery 可靠传输
No log entries are lost: if one node reads some log entry, then eventually every node that does not crash must also read that log entry.
不会丢失日志条目:如果一个节点读取了某个日志条目,那么最终所有未崩溃的节点也必须读取该日志条目。
Append-only 仅追加
Once a node has read some log entry, it is immutable, and new log entries can only be added after it, but not before. A node may re-read the log, in which case it sees the same log entries in the same order as it read them initially (even if the node crashes and restarts).
一旦一个节点读取了某个日志条目,它就是不可变的,新的日志条目只能追加在该条目之后,而不能在其之前。一个节点可以重新读取日志,此时它会以与最初读取时相同的顺序看到相同的日志条目(即使节点崩溃并重启)。
Agreement 协议
If two nodes both read some log entry e, then prior to e they must have read exactly the same sequence of log entries in the same order.
如果两个节点都读取了某个日志条目 e,那么在 e 之前,它们必须以相同的顺序读取了完全相同的日志条目序列。
Validity 有效性
If a node reads a log entry containing some value, then some node previously requested for that value to be added to the log.
如果一个节点读取了一个包含某个值的日志条目,那么之前必定有某个节点请求将该值添加到日志中。
Note 注意#
A shared log is formally known as a total order broadcast, atomic broadcast, or total order multicast protocol [26,76,77]. It’s the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.
一个共享日志在形式上被称为全序广播、原子广播或全序多播协议 [26, 76, 77]。这是用不同词语描述的同一件事:请求将值添加到日志中被称为 “广播” 它,而读取日志条目被称为 “交付” 它。
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node that wants to propose a value requests for it to be added to the log, and whichever value is read back in the first log entry is the value that is decided. Since all nodes read log entries in the same order, they are guaranteed to agree on which value is delivered first [28].
如果你有一个共享日志的实现,很容易解决一致性问题:每个想要提议一个值的节点都请求将其添加到日志中,而第一个日志条目中读取到的值就是被决定的值。由于所有节点都以相同的顺序读取日志条目,因此它们保证会就哪个值首先被交付达成一致 [28]。
Conversely, if you have a solution for consensus, you can implement a shared log. The details are a bit more complicated, but the basic idea is this [73]:
相反,如果你有一个解决一致性问题的方案,你可以实现一个共享日志。细节稍微复杂一些,但基本思想是这样的 [73]:
- You have a slot in the log for every future log entry, and you run a separate instance of the consensus algorithm for every such slot to decide what value should go in that entry.
日志中为每个未来的日志条目都有一个槽位,并且为每个这样的槽位运行一个独立的共识算法实例来决定应该将该条目中放入什么值。 - When a node wants to add a value to the log, it proposes that value for one of the slots that has not yet been decided.
当一个节点想要向日志中添加一个值时,它会为尚未决定的一个槽位提议这个值。 - When the consensus algorithm decides for one of the slots, and all the previous slots have already been decided, then the decided value is appended as a new log entry, and any consecutive slots that have been decided also have their decided value appended to the log.
当共识算法为其中一个槽位做出决定,并且所有之前的槽位都已经被决定时,那么被决定的值会被追加为一个新的日志条目,并且所有已决定的连续槽位的值也会被追加到日志中。 - If a proposed value was not chosen for some slot, the node that wanted to add it retries by proposing it for a later slot.
如果一个提议的值没有被选择用于某个槽位,那么想要添加它的节点会通过为后面的槽位再次提议它来重试。
This shows that consensus is equivalent to total order broadcast and shared logs. Single-leader replication without failover does not meet the liveness requirements, since it stops delivering messages if the leader crashes. As usual, the challenge is in performing failover safely and automatically.
这表明共识等同于全序广播和共享日志。无故障切换的单领导者复制不满足活性要求,因为它在领导者崩溃时会停止传递消息。像往常一样,挑战在于如何安全且自动地执行故障切换。
Fetch-and-add as consensus 作为共识的 fetch-and-add#
The linearizable ID generator we saw in “Linearizable ID Generators” comes close to solving consensus, but it falls slightly short. We can implement such an ID generator using a fetch-and-add operation, which atomically increments a counter and returns the old counter value.
我们在 “线性化 ID 生成器” 中看到的线性化 ID 生成器接近于解决共识问题,但它略有不逮。我们可以使用 fetch-and-add 操作来实现这样的 ID 生成器,该操作原子性地递增一个计数器并返回旧计数器值。
If you have a CAS operation, it’s easy to implement fetch-and-add: first read the counter value, then perform a CAS where the expected value is the value you read, and the new value is that value plus one. If the CAS fails, you retry the whole process until the CAS succeeds. This is less efficient than a native fetch-and-add operation when there is contention, but it is functionally equivalent. Since you can implement CAS using consensus, you can also implement fetch-and-add using consensus.
如果你有 CAS 操作,实现 fetch-and-add 很容易:首先读取计数器的值,然后执行 CAS 操作,其中期望值是你读取的值,新值是那个值加一。如果 CAS 失败,你重试整个过程直到 CAS 成功。在有竞争的情况下,这不如原生的 fetch-and-add 操作高效,但功能上等效。由于你可以使用共识来实现 CAS,你也可以使用共识来实现 fetch-and-add。
Conversely, if you have a fault-tolerant fetch-and-add operation, can you solve the consensus problem? Let’s say you initialize the counter to zero, and every node that wants to propose a value invokes the fetch-and-add operation to increment the counter. Since the fetch-and-add operation is atomic, one of the nodes will read the initial value of zero, and the others will all read a value that has been incremented at least once.
相反,如果你有容错的 fetch-and-add 操作,你能解决共识问题吗?假设你将计数器初始化为零,并且每个想要提议值的节点都调用 fetch-and-add 操作来递增计数器。由于 fetch-and-add 操作是原子的,其中一个节点会读取初始值零,而其他节点都会读取至少被递增过一次的值。
Now let’s say that the node that reads zero is the winner, and its value is decided. That works for the node that read zero, but the other nodes have a problem: they know that they are not the winner, but they don’t know which of the other nodes has won. The winner could send a message to the other nodes to let them know it has won, but what if the winner crashes before it has a chance to send this message? In that case the other nodes are left hanging, unable to decide any value, and thus the consensus does not terminate. And the other nodes can’t fall back to another node, because the node that read zero may yet come back and rightly decide the value it proposed.
现在假设读取零的节点是获胜者,其值也就确定了。这对读取零的节点来说可行,但对其他节点来说是个问题:它们知道自己不是获胜者,但不知道其他哪个节点获胜了。获胜者可以向其他节点发送消息告知其获胜,但如果获胜者在有机会发送这条消息之前就崩溃了怎么办?在这种情况下,其他节点会悬而未决,无法决定任何值,从而导致共识无法终止。而且其他节点无法退回到另一个节点,因为读取零的节点可能会回来并正确地决定它所提议的值。
An exception is if we know for sure that no more than two nodes will propose a value. In that case, the nodes can send each other the values they want to propose, and then each perform the fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one decides the other node’s value. This solves the consensus problem among two nodes, which is why we can say that fetch-and-add has a consensus number of two [28]. In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so they have a consensus number of ∞ (infinity).
例外情况是我们确切知道最多只有两个节点会提议一个值。在这种情况下,节点可以相互发送它们想要提议的值,然后每个节点执行 fetch-and-add 操作。读取到零的节点决定自己的值,读取到一的节点决定另一个节点的值。这解决了两个节点之间的共识问题,这也是为什么我们可以说 fetch-and-add 的共识数为二 [28]。相比之下,CAS 和共享日志解决了可能提议值的任意数量节点的共识问题,因此它们的共识数为无穷大(∞)。
Atomic commitment as consensus 原子提交作为共识#
In “Distributed Transactions” we saw the atomic commitment problem, which is to ensure that the databases or shards involved in a distributed transaction all either commit or abort a transaction. We also saw the two-phase commit algorithm, which relies on a coordinator that is a single point of failure.
在 “分布式事务” 中,我们了解了原子提交问题,即确保参与分布式事务的数据库或分片要么全部提交事务,要么全部中止事务。我们还了解了两阶段提交算法,该算法依赖于一个单一故障点的事务协调器。
What is the relationship between consensus and atomic commitment? At first glance, they seem very similar—both require nodes to come to some form of agreement. However, there is one important difference: with consensus it’s okay to decide any value that proposed, whereas with atomic commitment the algorithm must abort if any of the participants voted to abort. More precisely, atomic commitment requires the following properties [78]:
共识和原子提交之间的关系是什么?乍一看,它们似乎非常相似 —— 两者都要求节点达成某种形式的协议。然而,有一个重要的区别:在共识中,决定任何被提议的值都是可以接受的,而在原子提交中,如果任何参与者投票中止,算法必须中止。更精确地说,原子提交要求以下属性 [78]:
Uniform agreement 统一协议
No two nodes decide on different outcomes.
没有两个节点会决定不同的结果。
Integrity 完整性
Once a node has decided one outcome, it cannot change its mind by deciding another outcome.
一旦一个节点决定了一个结果,它不能通过决定另一个结果来改变主意。
Validity 有效性
If a node decides to commit, then all nodes must have previously voted to commit. If any node voted to abort, the nodes must abort.
如果一个节点决定提交,那么所有节点都必须之前投了赞成票。如果有任何节点投了反对票,所有节点都必须中止。
Non-triviality 非平凡性
If all nodes vote to commit, and no communication timeouts occur, then all nodes must decide to commit.
如果所有节点都投票提交,并且没有发生通信超时,那么所有节点都必须决定提交。
Termination 终止性
Every node that does not crash eventually decides to either commit or abort.
每个没有崩溃的节点最终都会决定提交或中止。
The validity property ensures that a transaction can only commit if all nodes agree; and the non-triviality property ensures the algorithm can’t simply always abort (but it allows an abort if any of the communication among the nodes times out). The other three properties are basically the same as for consensus.
有效性属性确保一个事务只有在所有节点都同意的情况下才能提交;非平凡性属性确保算法不能总是中止(但它允许在节点间通信超时时中止)。其他三个属性基本上与一致性相同。
If you have a solution for consensus, there are multiple ways you could solve atomic commitment [78,79]. One works like this: when you want to commit the transaction, every node sends its vote to commit or abort to every other node. Nodes that receive a vote to commit from itself and every other node propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which experience a timeout, propose “abort” using the consensus algorithm. When a node finds out what the consensus algorithm decided, it commits or aborts accordingly.
如果你有一个解决一致性的方案,那么你可以用多种方法解决原子提交 [78, 79]。其中一种方法如下:当你想要提交事务时,每个节点向其他所有节点发送提交或中止的投票。收到来自自己和所有其他节点的提交投票的节点,使用一致性算法提议 “提交”;收到中止投票或遇到超时的节点,使用一致性算法提议 “中止”。当一个节点得知一致性算法的决定后,它会相应地提交或中止。
In this algorithm, “commit” will only be proposed if all nodes voted to commit. If any node voted to abort, all proposals in the consensus algorithm will be “abort”. It could happen that some nodes propose “abort” while others propose “commit” if all nodes voted to commit but some communication timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do the same.
在这个算法中,只有当所有节点都投票同意提交时,“提交” 才会被提议。如果任何节点投票决定中止,那么共识算法中的所有提议都将变为 “中止”。如果所有节点都投票同意提交,但有些通信超时,可能会出现一些节点提议 “中止” 而另一些节点提议 “提交” 的情况;在这种情况下,节点是提交还是中止并不重要,只要它们都做同样的操作即可。
If you have a fault-tolerant atomic commitment protocol, you can also solve consensus. Every node that wants to propose a value starts a transaction on a quorum of nodes, and at each node it performs a single-node CAS to set a register to the proposed value if its value has not already been set by another transaction. If the CAS succeeds, the node votes to commit, otherwise it votes to abort. If the atomic commit protocol decides to commit a transaction, its value is decided for consensus; if atomic commit aborts, the proposing node retries with a new transaction.
如果你有一个容错的原子提交协议,你也可以解决共识问题。想要提议一个值的每个节点在多个节点的多数派上启动一个事务,并在每个节点上执行单节点 CAS 操作,如果该节点的值尚未被另一个事务设置,则将其注册设置为提议的值。如果 CAS 操作成功,该节点投票决定提交,否则它投票决定中止。如果原子提交协议决定提交一个事务,那么该值就决定了共识;如果原子提交中止,提议节点将使用新的事务重试。
This shows that atomic commit and consensus are also equivalent to each other.
这表明原子提交和共识也是彼此等价的。
Consensus in Practice 实践中的共识#
We have seen that single-value consensus, CAS, shared logs, and atomic commitment are all equivalent to each other: you can convert a solution to one of them into a solution to any of the others. That is a valuable theoretical insight, but it doesn’t answer the question: which of these many formulations of consensus is the most useful in practice?
我们已经看到单值一致性、CAS、共享日志和原子提交都是等价的:你可以将一个解决方案转换为它们中的任何一个。这是一个有价值的理论见解,但它并没有回答这个问题:在这些众多的一致性表述中,哪一个在实践中最有用?
The answer is that most consensus systems provide shared logs, also known as total order broadcast. Raft, Viewstamped Replication, and Zab provide shared logs right out of the box. Paxos provides single-value consensus, but in practice most systems using Paxos actually use the extension called Multi-Paxos, which also provides a shared log.
答案是大多数一致性系统都提供共享日志,也称为全序广播。Raft、Viewstamped Replication 和 Zab 直接提供共享日志。Paxos 提供单值一致性,但在实践中,大多数使用 Paxos 的系统实际上使用称为 Multi-Paxos 的扩展,它也提供了共享日志。
Using shared logs 使用共享日志#
A shared log is a good fit for database replication: if every log entry represents a write to the database, and every replica processes the same writes in the same order using deterministic logic, then the replicas will all end up in a consistent state. This idea is known as state machine replication [80], and it is the principle behind event sourcing, which we saw in “Event Sourcing and CQRS”. Shared logs are also useful for stream processing, as we shall see in Chapter 12.
一个共享日志非常适合数据库复制:如果每个日志条目都代表对数据库的一次写入,并且每个副本都使用确定性逻辑以相同顺序处理相同的写入,那么所有副本最终都会达到一致状态。这个概念被称为状态机复制 [80],它是事件溯源背后的原理,我们在 “事件溯源与 CQRS” 中看到过。共享日志对于流处理也非常有用,正如我们在第 12 章中将要看到的。
Similarly, a shared log can be used to implement serializable transactions: as discussed in “Actual Serial Execution”, if every log entry represents a deterministic transaction to be executed as a stored procedure, and if every node executes those transactions in the same order, then the transactions will be serializable [81,82].
同样地,共享日志也可以用来实现可串行化的事务:正如在 “实际串行执行” 中讨论的,如果每个日志条目都表示一个确定性事务,并且每个节点都以相同的顺序执行这些事务,那么这些事务将是可串行化的 [81, 82]。
Note 注意#
Sharded databases with a strong consistency model often maintain a separate log per shard, which improves scalability, but limits the consistency guarantees (e.g., consistent snapshots, foreign key references) they can offer across shards. Serializable transactions across shards are possible, but require additional coordination [83].
具有强一致性模型的分片数据库通常为每个分片维护一个单独的日志,这提高了可扩展性,但限制了它们能在分片之间提供的容错保证(例如,一致性快照、外键引用)。跨分片的可串行化事务是可能的,但需要额外的协调 [83]。
A shared log is also powerful because it can easily be adapted to other forms of consensus:
共享日志也很强大,因为它可以很容易地适应其他形式的共识:
- We saw previously how to use it to implement single-value consensus and CAS: simply decide the value that appears first in the log.
我们之前看到了如何使用它来实现单值共识和 CAS:只需决定日志中首先出现的值。 - If you want many instances of single-value consensus (e.g. one per seat in a theater that several people are trying to book), include the seat number in the log entries, and decide the first log entry that contains a given seat number.
如果你需要多个单值一致性实例(例如,在剧院的每个座位上,有几个人正在尝试预订),请在日志条目中包含座位号,并决定包含给定座位号的第一个日志条目。 - If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the current counter value is the sum of all of the log entries so far. A simple counter on log entries can be used to generate fencing tokens (see “Fencing off zombies and delayed requests”); for example, in ZooKeeper, this sequence number is called
zxid[18].
如果你需要一个原子获取并加操作,将加到计数器上的数字放在日志条目中,当前计数器的值是到目前为止所有日志条目的总和。可以在日志条目上使用一个简单的计数器来生成围栏令牌(参见 “隔离僵尸和延迟请求”);例如,在 ZooKeeper 中,这个序列号被称为zxid[ 18]。
From single-leader replication to consensus 从单领导者复制到一致性#
We saw previously that single-value consensus is easy if you have a single “dictator” node that makes the decision, and likewise a shared log is easy if a single leader is the only node that is allowed to append entries to it. The question is how to provide fault tolerance if that node fails.
我们之前看到,如果你有一个单一的 “独裁者” 节点来做出决策,单值一致性就很容易,同样,如果只有一个领导者节点被允许向它追加条目,共享日志也很容易。问题是如果该节点失败,如何提供容错能力。
Traditionally, databases with single-leader replication didn’t solve this problem: they left leader failover as an action that a human administrator had to perform manually. Unfortunately, this means a significant amount of downtime, since there is a limit to how fast humans can react, and it doesn’t satisfy the termination property of consensus. For consensus, we require that the algorithm can automatically choose a new leader. (Not all consensus algorithms have a leader, but the commonly used algorithms do [84].)
传统上,单领导者复制的数据库并没有解决这个问题:它们将领导者切换视为需要人工管理员手动执行的操作。不幸的是,这意味着会存在大量停机时间,因为人类反应速度有限,而且这也不满足共识的终止特性。对于共识而言,我们要求算法能够自动选择新的领导者。(并非所有共识算法都有领导者,但常用的算法是有的 [84]。)
However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to be the leader, and consequently make inconsistent decisions. Thus, it seems like we need consensus in order to elect a leader, and we need a leader in order to solve consensus. How do we break out of this conundrum?
然而,存在一个问题。我们之前讨论了分裂脑问题,并指出所有节点需要就领导者是谁达成一致 —— 否则两个不同的节点可能会各自认为自己就是领导者,并因此做出不一致的决策。因此,似乎我们需要共识来选举领导者,而我们需要领导者来解决共识问题。我们如何摆脱这个困境?
In fact, consensus algorithms don’t require that there is only one leader at any one time. Instead, they make a weaker guarantee: they define an epoch number (called the ballot number in Paxos,view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.
事实上,共识算法并不要求在任何时候只有一个领导者。相反,它们提供了一个较弱的保证:它们定义了一个时期编号(在 Paxos 中称为投票编号,在 Viewstamped Replication 中称为视图编号,在 Raft 中称为任期编号),并保证在每个时期内,领导者是唯一的。
When a node believes that the current leader is dead because it hasn’t heard from the leader for some timeout, it may start a vote to elect a new leader. This election is given a new epoch number that is greater than any previous epoch. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.
当一个节点认为当前领导者已经死亡,因为它没有在超时时间内收到来自领导者的消息时,它可能会开始投票选举新的领导者。这次选举会被赋予一个新的时期编号,该编号大于任何之前的时期编号。如果两个不同时期的两个不同领导者之间存在冲突(可能是因为之前的领导者实际上并没有死亡),那么具有较高时期编号的领导者将胜出。
Before a leader is allowed to append the next entry to the shared log, it must first check that there isn’t some other leader with a higher epoch number which might append a different entry. It can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of nodes [85]. A node votes yes only if it is not aware of any other leader with a higher epoch.
在允许领导者向共享日志追加下一个条目之前,它必须首先确认没有其他具有更高纪元数的领导者可能追加不同的条目。它可以通过收集多数节点(通常,但并非总是)的投票来实现这一点 [85]。只有当节点没有意识到任何具有更高纪元的其他领导者时,才会投票赞成。
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal for the next entry to append to the log. The quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent successful leader election [85]. Thus, if the vote on a proposal passes without revealing any higher-numbered epoch, the current leader can conclude that no leader with a higher epoch number has been elected, and therefore it can safely append the proposed entry to the log [26,86].
因此,我们有两轮投票:一次用于选择领导者,另一次用于投票决定领导者对日志追加下一个条目的提议。这两轮投票的法定人数必须重叠:如果对提议的投票成功,至少有一个投票赞成该提议的节点也参与了最近一次成功的领导者选举 [85]。因此,如果对提议的投票通过而没有揭示任何更高编号的纪元,当前领导者可以得出结论,没有选举出具有更高纪元数的领导者,因此它可以安全地将提议的条目追加到日志中 [26, 86]。
These two rounds of voting look superficially similar to two-phase commit, but they are very different protocols. In consensus algorithms, any node can start an election and it requires only a quorum of nodes to respond; in 2PC, only the coordinator can request votes, and it requires a “yes” vote from every participant before it can commit.
这两轮投票在表面上看起来类似于两阶段提交,但它们是非常不同的协议。在共识算法中,任何节点都可以发起选举,并且只需要多数节点响应;而在两阶段提交中,只有协调者可以请求投票,并且它需要从每个参与者那里获得 “是” 的投票才能提交。
Subtleties of consensus 共识的微妙之处#
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that the leader wants to append to the log [68,69]. Every new log entry is synchronously replicated to a quorum of nodes before it is confirmed to the client that requested the write. This ensures that the log entry won’t be lost if the current leader fails.
这种基本结构在 Raft、多 Paxos、Zab 和视图标记复制中都是通用的:多数节点的投票选出一个领导者,然后领导者想要追加到日志中的每个条目都需要另一个多数节点的投票 [68, 69]。每个新的日志条目在确认给请求写入的客户端之前,都会同步复制到多数节点。这确保了如果当前的领导者失败,日志条目不会丢失。
However, the devil is in the details, and that’s also where these algorithms take different approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to ensure that the new leader honors any log entries that had already been appended by the old leader before it failed. Raft does this by only allowing a node to become the new leader if its log is at least as up-to-date as a majority of its followers [69]. In contrast, Paxos allows any node to become the new leader, but requires it to bring its log up-to-date with other nodes before it can start appending new entries of its own.
然而,细节才是关键,这也是这些算法采取不同方法的地方。例如,当旧领导者失效并选出新的领导者时,算法需要确保新领导者尊重旧领导者在其失效前已追加的任何日志条目。Raft 通过仅允许一个节点的日志至少与它的大多数追随者一样最新,来做到这一点 [69]。相比之下,Paxos 允许任何节点成为新的领导者,但要求它在开始追加自己的新条目之前,必须将其日志与其他节点保持最新。
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old leader before it failed, but for which the vote on appending to the log had not yet completed. You can find discussions of these details in the references for this chapter [23,69,86].
另一个微妙之处在于算法如何处理旧领导者在其失败之前提议但尚未完成追加到日志的投票记录。你可以在本章的参考文献中找到关于这些细节的讨论 [23, 69, 86]。
For databases that use a consensus algorithm for replication, not only do writes need to be turned into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also have to go through a quorum vote similarly to a write, to confirm that the node that believes to be the leader really still is up-to-date. Linearizable reads in etcd work like this, for example.
对于使用共识算法进行复制的数据库,不仅写入需要转换为日志记录并复制到多数派。如果你要保证线性化读取,它们也需要像写入一样通过多数派投票来确认,以确认认为自己是领导者的节点是否仍然是最新的。例如,etcd 中的线性化读取就是这样工作的。
In their standard form, most consensus algorithms assume a fixed set of nodes—that is, nodes may go down and come back up again, but the set of nodes that is allowed to vote is fixed when the cluster is created. In practice, it’s often necessary to add new nodes or remove old nodes in a system configuration. Consensus algorithms have been extended with reconfiguration features that make this possible. This is especially useful when adding new regions to a system, or when migrating from one location to another (by first adding the new nodes, and then removing the old nodes).
在标准形式下,大多数共识算法假设一组固定的节点 —— 也就是说,节点可能会宕机并再次上线,但在集群创建时允许投票的节点集是固定的。在实践中,系统配置中经常需要添加新节点或移除旧节点。共识算法通过引入重新配置功能来扩展这一能力。当向系统添加新区域,或从一个地点迁移到另一个地点时(通过先添加新节点,然后移除旧节点),这一点尤其有用。
Pros and cons of consensus 共识的优缺点#
Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed systems. Consensus is essentially “single-leader replication done right”, with automatic failover on leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the face of all the problems we discussed in Chapter 9.
尽管复杂且微妙,共识算法是分布式系统的一大突破。共识本质上是 “正确实现的单领导复制”,具有在领导节点故障时的自动故障转移功能,确保不会丢失已提交的数据,且即使在第 9 章中讨论的所有问题面前,也不可能发生脑裂。
Since single-leader replication with automatic failover is essentially one of the definitions of consensus, any system that provides automatic failover but does not use a proven consensus algorithm is likely to be unsafe [87]. Using a proven consensus algorithm is not a guarantee of correctness of the whole system—there are still plenty of other places where bugs can lurk—but it’s a good start.
由于单领导者复制与自动故障转移本质上是对一致性的定义之一,任何提供自动故障转移但未使用经过验证的一致性算法的系统很可能是不安全的 [87]。使用经过验证的一致性算法并不能保证整个系统的正确性 —— 仍然有大量其他地方可能隐藏着错误 —— 但它是一个好的开始。
Nevertheless, consensus is not used everywhere, because the benefits come at a cost. Consensus systems always require a strict majority to operate—three nodes to tolerate one failure, or five nodes to tolerate two failures. Every operation needs to communicate with a quorum, so you can’t increase throughput by adding more nodes (in fact, every node you add makes the algorithm slower). If a network partition cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest are blocked.
然而,共识机制并非无处不在,因为其优势伴随着代价。共识系统始终需要严格多数来运行 —— 三个节点来容忍一个故障,或五个节点来容忍两个故障。每项操作都需要与多数节点通信,因此你无法通过增加更多节点来提高吞吐量(实际上,每增加一个节点都会使算法变慢)。如果网络分区导致部分节点与其它节点断开连接,只有多数部分的网络可以继续运行,而其余部分则被阻塞。
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially systems distributed across multiple geographic regions, it can be difficult to tune these timeouts: if they are too large it takes a long time to recover from a failure; if they are too small there can be lots of unnecessary leader elections, resulting in terrible performance as the system can end up spending more time choosing leaders than doing useful work.
共识系统通常依赖超时机制来检测失效节点。在网络延迟变化剧烈的环境中,尤其是在跨多个地理区域分布的系统中,调整这些超时时间可能很困难:如果超时时间过长,系统从故障中恢复的时间就会很长;如果超时时间过短,则会出现大量不必要的领导者选举,导致系统性能恶化,因为系统最终会花费更多时间选择领导者,而不是执行有用的工作。
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft has been shown to have unpleasant edge cases [88,89]: if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes, or the current leader is continually forced to resign, so the system effectively never makes progress. Designing algorithms that are more robust to unreliable networks is still an open research problem.
有时,共识算法对网络问题特别敏感。例如,Raft 已被证明在某些边缘情况下表现不佳 [88, 89]:如果整个网络除了一个持续不可靠的特定网络链路外都正常工作,Raft 可能会陷入领导权在两个节点之间不断切换,或者当前领导者被迫不断辞职的情况,导致系统实际上无法取得进展。设计对不可靠网络更具鲁棒性的算法仍然是一个开放的研究问题。
For systems that want to be highly available, but don’t want to accept the cost of consensus, the only real alternative is to use a weaker consistency model instead, such as those offered by leaderless or multi-leader replication as discussed in Chapter 6. These approaches generally don’t offer linearizability, but for applications that don’t need it that is fine.
对于那些希望实现高可用性但不想接受共识代价的系统来说,唯一真正的替代方案是使用较弱的强一致性模型,例如在第 6 章中讨论的无领导者或多领导者复制所提供的模型。这些方法通常不提供线性化,但对于不需要它的应用来说这就足够了。
Coordination Services 协调服务#
Consensus algorithms are useful in any distributed database that wants to offer linearizable operations, and many modern distributed databases use consensus algorithms for replication. But one family of systems is a particularly prominent user of consensus: coordination services such as ZooKeeper, etcd, or Consul. Although these systems look superficially like any other key-value store, they are not designed for general-purpose data storage like most databases.
共识算法在任何希望提供线性化操作的分布式数据库中都很有用,许多现代分布式数据库使用共识算法进行复制。但有一类系统是共识的特别突出使用者:协调服务,如 ZooKeeper、etcd 或 Consul。尽管这些系统表面上看起来像任何其他键值存储,但它们并非像大多数数据库那样设计用于通用数据存储。
Instead, they are designed to coordinate between nodes of another distributed system. For example, Kubernetes relies on etcd, while Spark and Flink in high availability mode rely on ZooKeeper running in the background. Coordination services are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability), which is replicated across multiple nodes using a fault-tolerant consensus algorithm.
相反,它们的设计目的是协调另一个分布式系统的节点。例如,Kubernetes 依赖于 etcd,而 Spark 和 Flink 在高可用模式下依赖于后台运行的 ZooKeeper。协调服务被设计用来存储少量可以完全适合内存中的数据(尽管它们仍然写入磁盘以确保持久性),这些数据使用容错共识算法在多个节点之间进行复制。
Coordination services are modeled after Google’s Chubby lock service [17,58]. They combine a consensus algorithm with several other features that turn out to be particularly useful when building distributed systems:
协调服务模仿了 Google 的 Chubby 锁服务 [17, 58]。它们结合了共识算法和其他几个在构建分布式系统时特别有用的特性:
Locks and leases 锁和租约
We saw previously how consensus systems can implement an atomic, fault-tolerant compare-and-set (CAS) operation. Coordination services rely on this approach to implement locks and leases: if several nodes concurrently try to acquire the same lease, only one of them will succeed.
我们之前看到了共识系统如何实现原子、容错的比较并设置(CAS)操作。协调服务依赖这种方法来实施锁和租约:如果多个节点同时尝试获取同一个租约,只有其中一个会成功。
Support for fencing 对围栏的支持
As discussed in “Distributed Locks and Leases”, when a resource is protected by a lease, you need fencing to prevent clients from interfering with each other in the case of a process pause or large network delay. Consensus systems can generate fencing tokens by giving each log entry a monotonically increasing ID (zxid and cversion in ZooKeeper, revision number in etcd).
正如 “分布式锁和租约” 中讨论的,当资源被租约保护时,你需要围栏来防止客户端在进程暂停或大网络延迟的情况下相互干扰。共识系统可以通过给每个日志条目分配单调递增的 ID 来生成围栏令牌(ZooKeeper 中的 zxid 和 cversion ,etcd 中的修订号)。
Failure detection 故障检测
Clients maintain a long-lived session on the coordination service, and periodically exchange heartbeats to check if the other node is still alive. Even if the connection is temporarily interrupted, or a server fails, any leases held by the client remain active. However, if there is no heartbeat for longer than the timeout of the lease, the coordination service assumes the client is dead and releases the lease (ZooKeeper calls these ephemeral nodes).
客户端在协调服务上保持一个长生命周期的会话,并定期交换心跳以检查另一个节点是否仍然存活。即使连接暂时中断,或服务器发生故障,客户端持有的任何租约仍然保持有效。然而,如果租约的超时时间超过了心跳的间隔,协调服务会认为客户端已死亡并释放租约(ZooKeeper 将这些节点称为临时节点)。
Change notifications 变更通知
A client can request that the coordination service sends it a notification whenever certain keys change. This allows a client to find out when another client joins the cluster (based on the value it writes to the coordination service), or if another client fails (because its session times out and its ephemeral nodes disappear), for example. These notifications save the client from having to frequently poll the service to find out about changes.
客户端可以请求协调服务在特定键发生变化时发送通知。这允许客户端知道另一个客户端加入集群(基于它写入协调服务的值),或另一个客户端发生故障(因为它的会话超时且临时节点消失),例如。这些通知避免了客户端频繁轮询服务以获取变更信息。
Failure detection and change notifications do not require consensus, but they are useful for distributed coordination alongside the atomic operations and fencing support that do require consensus.
故障检测和变更通知不需要达成共识,但它们对于分布式协调是有用的,而原子操作和需要达成共识的互斥支持则非常有用。
Allocating work to nodes 分配工作到节点#
A coordination service is useful if you have several instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is necessary for single-leader databases, but it’s also appropriate for job schedulers and similar stateful systems.
如果您的系统有多个进程或服务的实例,并且其中一个需要被选为领导者或主节点,那么协调服务就很有用。如果领导者失败,其他节点中的一个应该接管。这对于单领导者数据库是必要的,但也适用于作业调度器和类似的有状态系统。
Another use case is when you have some sharded resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which shard to assign to which node. As new nodes join the cluster, some of the shards need to be moved from existing nodes to the new nodes in order to rebalance the load. As nodes are removed or fail, other nodes need to take over the failed nodes’ work.
另一个用例是当您有一个分片的资源(数据库、消息流、文件存储、分布式演员系统等),并且需要决定将哪个分片分配给哪个节点。当新节点加入集群时,一些分片需要从现有节点移动到新节点,以重新平衡负载。当节点被移除或失败时,其他节点需要接管失败节点的任务。
These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and notifications in a coordination service. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which would be very prone to bugs.
这些任务可以通过在协调服务中合理使用原子操作、临时节点和通知来实现。如果操作得当,这种方法允许应用程序在无需人工干预的情况下自动从故障中恢复。尽管出现了像 Apache Curator 这样的库,它们在 ZooKeeper 客户端 API 之上提供了更高级的工具,但这并不容易 —— 但它仍然比从头开始实现必要的共识算法要好得多,后者非常容易出错。
A dedicated coordination service also has the advantage that it can run on a fixed set of nodes (usually three or five), regardless of how many nodes there are in the distributed system that relies on it for coordination. For example, in a storage system with thousands of shards, it would be terribly inefficient to run a consensus algorithm over thousands of nodes; it’s much better to “outsource” the consensus to a small number of nodes running a coordination service.
专门的协调服务还有一个优点,即它可以运行在固定的一组节点上(通常是三或五个),而不管依赖它进行协调的分布式系统中有多少节点。例如,在一个有数千个分片的存储系统中,在数千个节点上运行共识算法将非常低效;将共识 “外包” 给运行协调服务的一小部分节点要更好得多。
Normally, the kind of data managed by a coordination service is quite slow-changing: it represents information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such assignments usually change on a timescale of minutes or hours. Coordination services are not intended for storing data that may change thousands of times per second. For that, it is better to use a conventional database; alternatively, tools like Apache BookKeeper [90,91] can be used to replicate fast-changing internal state of a service.
通常情况下,协调服务管理的数据变化非常缓慢:它代表的信息如 “运行在 IP 地址 10.1.1.23 的节点是分片 7 的领导者”,这类分配通常在分钟或小时的时间尺度上发生变化。协调服务并非用于存储每秒可能变化数千次的数据。对于这种情况,最好使用传统数据库;或者,可以使用 Apache BookKeeper [90, 91] 等工具来复制服务快速变化的内部状态。
Service discovery 服务发现#
ZooKeeper, etcd, and Consul are also often used for service discovery —that is, to find out which IP address you need to connect to in order to reach a particular service (see “Load balancers, service discovery, and service meshes”). In cloud environments, where it is common for virtual machines to continually come and go, you often don’t know the IP addresses of your services ahead of time. Instead, you can configure your services such that when they start up they register their network endpoints in a service registry, where they can then be found by other services.
ZooKeeper、etcd 和 Consul 也常用于服务发现 —— 即查找连接到特定服务的 IP 地址(参见 “负载均衡器、服务发现和服务网格”)。在云环境中,虚拟机经常不断出现和消失,你通常无法提前知道服务的 IP 地址。相反,你可以配置服务,使其在启动时将其网络端点注册到服务注册表中,然后其他服务可以在那里找到它们。
Using a coordination service for service discovery can be convenient, as its failure detection and change notification features make it easy for clients to keep track of service instances as they come and go. And if you are already using a coordination service for leases, locking, or leader election, it makes sense to also use it for service discovery, since it already knows which node should receive requests for your service.
使用协调服务进行服务发现可以很方便,因为其故障检测和变更通知功能使客户端能够轻松跟踪服务的实例随来随走。如果你已经使用协调服务进行租约、锁定或领导者选举,那么将其用于服务发现也是合理的,因为它已经知道哪个节点应该接收你的服务请求。
However, using consensus for service discovery is often overkill: this use case often doesn’t require linearizability, and it’s more important that service discovery is highly available and fast, since without it everything would grind to a halt. It’s therefore often preferable to cache service discovery information and accept that it might be slightly stale. For example, DNS-based service discovery uses multiple layers of caching to achieve good performance and availability.
然而,使用共识机制进行服务发现往往是大材小用:这种用例通常不需要线性化,更重要的是服务发现具有高度可用性和快速响应,因为如果没有它,一切都会陷入停滞。因此,通常更倾向于缓存服务发现信息,并接受其可能存在轻微的过时。例如,基于 DNS 的服务发现使用多层缓存来达到良好的性能和可用性。
To support this use case, ZooKeeper supports observers, which are replicas that receive the log and maintain a copy of the data stored in ZooKeeper, but which do not participate in the consensus algorithm’s voting process. Reads from an observer are not linearizable as they might be stale, but they remain available even if the network is interrupted, and they increase the read throughput that the system can support by caching.
为了支持这种用例,ZooKeeper 支持观察者,观察者是副本,它们接收日志并维护 ZooKeeper 中存储的数据副本,但它们不参与共识算法的投票过程。从观察者读取的数据可能存在过时,因此无法线性化,但即使网络中断,它们仍然可用,并且通过缓存,它们增加了系统可以支持的数据读取吞吐量。
Summary 摘要#
In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is, and how to achieve it. We looked in depth at linearizability, a popular formalization of strong consistency: it means that replicated data appears as though there were only a single copy, and all operations act on it atomically. We saw that linearizability is useful when you need some data to be up-to-date when you read it, or if you need to resolve a race condition (e.g. if multiple nodes are concurrently trying to do the same thing, such as creating files with the same name).
在本章中,我们探讨了容错系统中强一致性的主题:它是什么,以及如何实现它。我们深入研究了线性化,这是强一致性的一个流行形式化:这意味着复制数据看起来只有一份副本,所有操作都对其原子地执行。我们看到,当您需要读取时数据保持最新,或者需要解决竞争条件(例如,如果多个节点同时尝试做同一件事,如创建同名文件)时,线性化是有用的。
Although linearizability is appealing because it is easy to understand—it makes a database behave like a variable in a single-threaded program—it has the downside of being slow, especially in environments with large network delays. Many replication algorithms don’t guarantee linearizability, even though it superficially might seem like they might provide strong consistency.
尽管线性化因其易于理解而吸引人 —— 它使数据库表现得像单线程程序中的一个变量 —— 但它有一个缺点,那就是速度慢,特别是在具有较大网络延迟的环境中。许多复制算法并不保证线性化,即使它们表面上似乎可以提供强一致性。
Next, we applied the concept of linearizability in the context of ID generators. A single-node auto-incrementing counter is linearizable, but not fault-tolerant. Many distributed ID generation schemes don’t guarantee that the IDs are ordered consistently with the order in which the events actually happened. Logical clocks such as Lamport clocks and hybrid logical clocks provide ordering that is consistent with causality, but no linearizability.
接下来,我们将线性化概念应用于 ID 生成器的上下文中。单节点自动递增计数器是线性化的,但不可靠。许多分布式 ID 生成方案并不能保证 ID 的顺序与事件实际发生的顺序一致。逻辑时钟如 Lamport 时钟和混合逻辑时钟提供了与因果关系一致的排序,但没有线性化特性。
This led us to the concept of consensus. We saw that achieving consensus means deciding something in such a way that all nodes agree on what was decided, and such that they can’t change their mind. A wide range of problems are actually reducible to consensus and are equivalent to each other (i.e., if you have a solution for one of them, you can transform it into a solution for all of the others). Such equivalent problems include:
这引出了共识的概念。我们看到,实现共识意味着以一种方式做出决定,使得所有节点都同意决定的内容,并且他们不能改变主意。许多问题实际上可以简化为共识问题,并且它们彼此等价(即,如果你为其中一个问题有解决方案,你可以将其转化为其他所有问题的解决方案)。这类等价问题包括:
Linearizable compare-and-set operation
线性化比较并设置操作
The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.
寄存器需要根据其当前值是否等于操作中给定的参数,原子性地决定是否设置其值。
Locks and leases 锁和租约
When several clients are concurrently trying to grab a lock or lease, the lock decides which one successfully acquired it.
当多个客户端同时尝试获取锁或租约时,锁会决定哪一个成功获取。
Uniqueness constraints 唯一性约束
When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.
当多个事务同时尝试创建具有相同键的冲突记录时,约束必须决定允许哪一个,哪一个应该因约束违规而失败。
Shared logs 共享日志
When several nodes concurrently want to append entries to a log, the log decides in which order they are appended. Total order broadcast is also equivalent.
当多个节点同时想要向日志中追加条目时,日志会决定它们的追加顺序。总排序广播也是等效的。
Atomic transaction commit
原子事务提交
The database nodes involved in a distributed transaction must all decide the same way whether to commit or abort the transaction.
参与分布式事务的数据库节点必须以相同的方式决定是提交还是中止事务。
Linearizable fetch-and-add operation
可线性化的获取并加操作
This operation can be used to implement an ID generator. Several nodes can concurrently invoke the operation, and it decides the order in which they increment the counter. This case actually solves consensus only between two nodes, while the others work for any number of nodes.
此操作可用于实现 ID 生成器。多个节点可以并发调用该操作,并决定它们按何种顺序递增计数器。这种情况实际上只解决了两个节点之间的共识,而其他情况适用于任意数量的节点。
All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations, uniqueness constraints, a replication log, and more.
如果你只有一个节点,或者愿意将决策能力分配给单个节点,所有这些都很直接。这就是单领导者数据库发生的情况:所有决策权都集中在领导者手中,这就是为什么这类数据库能够提供线性化操作、唯一性约束、复制日志等。
However, if that single leader fails, or if a network interruption makes the leader unreachable, such a system becomes unable to make any progress until a human performs a manual failover. Widely-used consensus algorithms like Raft and Paxos are essentially single-leader replication with built-in automatic leader election and failover if the current leader fails.
然而,如果单个领导者失败,或者网络中断导致领导者无法访问,这样的系统将无法继续进行,直到人类执行手动故障转移。像 Raft 和 Paxos 这样广泛使用的共识算法本质上就是单领导者复制,内置了自动领导者选举和故障转移机制,以防当前领导者失败。
Consensus algorithms are carefully designed to ensure that no committed writes are lost during a failover, and that the system cannot get into a split brain state in which multiple nodes are accepting writes. This requires that every write, and every linearizable read, is confirmed by a quorum (typically a majority) of nodes. This can be expensive, especially across geographic regions, but it is unavoidable if you want the strong consistency and fault tolerance that consensus provides.
共识算法经过精心设计,以确保在故障转移期间不会丢失已提交的写入,并且系统不会进入分裂脑状态,即多个节点同时接受写入。这要求每次写入和每次可线性化读取都必须得到多数节点(即法定人数)的确认。这可能很昂贵,尤其是在跨地域的情况下,但如果你想要共识算法提供的强一致性和容错性,这是不可避免的。
Coordination services like ZooKeeper and etcd are also built on top of consensus algorithms. They provide locks, leases, failure detection, and change notification features that are useful for managing the state of distributed applications. If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, it is advisable to use a coordination service. It won’t guarantee that you will get it right, but it will probably help.
像 ZooKeeper 和 etcd 这样的协调服务也是基于共识算法构建的。它们提供锁、租约、故障检测和变更通知等功能,这些功能对于管理分布式应用的状态非常有用。如果你发现自己需要做某件可以简化为共识的事情,并且希望它具有容错性,建议使用协调服务。它不能保证你一定能做对,但可能会有所帮助。
Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory that has been developed since the 1980s. This theory makes it possible to build systems that can tolerate all the faults that we discussed in Chapter 9, and still ensure that your data is not corrupted. This is an amazing achievement, and the references at the end of this chapter feature some of the highlights of this work.
共识算法复杂而微妙,但它们得到了自 20 世纪 80 年代以来发展起来的丰富理论支持。这一理论使得构建能够容忍我们在第 9 章讨论的所有故障的系统成为可能,同时还能确保您的数据不会损坏。这是一项惊人的成就,本章末尾的参考文献列出了这项工作的部分亮点。
Nevertheless, consensus is not always the right tool: in some systems, the strong consistency properties it provides are not needed, and it is better to have weaker consistency with higher availability and better performance. In these cases, it is common to use leaderless or multi-leader replication, which we previously discussed in Chapter 6. The logical clocks that we discussed in this chapter are helpful in that context.
然而,共识并非总是合适的工具:在某些系统中,它提供的强一致性特性并非必需,而拥有更弱的强一致性、更高的可用性和更好的性能会更好。在这些情况下,通常使用无领导者或多领导者复制,我们之前在第 6 章讨论过这些内容。本章讨论的逻辑时钟在这种情况下很有帮助。
Footnotes 脚注#
References 参考文献#
[1] Maurice P. Herlihy and Jeannette M. Wing.Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems (TOPLAS), volume 12, issue 3, pages 463–492, July 1990.doi.1145/78969.78972
[1] Maurice P. Herlihy 和 Jeannette M. Wing. 可线性化性:并发对象的一个正确性条件。ACM 计算机程序设计语言与系统杂志(TOPLAS),第 12 卷,第 3 期,第 463–492 页,1990 年 7 月。doi.1145/78969.78972
[2] Leslie Lamport.On interprocess communication. Distributed Computing, volume 1, issue 2, pages 77–101, June 1986. doi.1007/BF01786228
[2] Leslie Lamport. 关于进程间通信。分布式计算,第 1 卷,第 2 期,第 77–101 页,1986 年 6 月。doi.1007/BF01786228
[3] David K. Gifford.Information Storage in a Decentralized Computer System. Xerox Palo Alto Research Centers, CSL-81-8, June 1981. Archived at perma.cc/2XXP-3JPB
[3] David K. Gifford. 分布式计算机系统中的信息存储。Xerox 桃园研究中心,CSL-81-8,1981 年 6 月。存档于 perma.cc / 2XXP - 3JPB
[4] Martin Kleppmann.Please Stop Calling Databases CP or AP. martin.kleppmann.com, May 2015. Archived at perma.cc/MJ5G-75GL
[4] Martin Kleppmann. 请停止将数据库称为 CP 或 AP。martin.kleppmann.com,2015 年 5 月。存档于 perma.cc / MJ5G - 75GL
[5] Kyle Kingsbury.Call Me Maybe: MongoDB Stale Reads. aphyr.com, April 2015. Archived at perma.cc/DXB4-J4JC
[5] Kyle Kingsbury. 叫我也许:MongoDB 陈旧读取. aphyr.com, 2015 年 4 月. 归档于 perma.cc/DXB4-J4JC
[6] Kyle Kingsbury.Computational Techniques in Knossos. aphyr.com, May 2014. Archived at perma.cc/2X5M-EHTU
[6] Kyle Kingsbury. Knossos 中的计算技术. aphyr.com, 2014 年 5 月. 归档于 perma.cc / 2X5M - EHTU
[7] Kyle Kingsbury and Peter Alvaro.Elle: Inferring Isolation Anomalies from Experimental Observations. Proceedings of the VLDB Endowment, volume 14, issue 3, pages 268–280, November 2020.doi.14778/3430915.3430918
[7] Kyle Kingsbury 和 Peter Alvaro. Elle:从实验观察中推断隔离异常. VLDB 基金会会议录, 第 14 卷, 第 3 期, 第 268-280 页, 2020 年 11 月. doi.14778/3430915.3430918
[8] Paolo Viotti and Marko Vukolić.Consistency in Non-Transactional Distributed Storage Systems. ACM Computing Surveys (CSUR), volume 49, issue 1, article no. 19, June 2016.doi.1145/2926965
[8] Paolo Viotti 和 Marko Vukolić. 非事务性分布式存储系统中的数据一致性. ACM 计算综述 (CSUR), 第 49 卷, 第 1 期, 第 19 篇, 2016 年 6 月. doi.1145/2926965
[9] Peter Bailis.Linearizability Versus Serializability. bailis.org, September 2014. Archived at perma.cc/386B-KAC3
[9] Peter Bailis. 线性化与可串行化. bailis.org, 2014 年 9 月. 永久存档于 perma.cc / 386B - KAC3
[10] Daniel Abadi.Correctness Anomalies Under Serializable Isolation. dbmsmusings.blogspot.com, June 2019. Archived at perma.cc/JGS7-BZFY
[10] Daniel Abadi. 可串行化隔离下的正确性异常. dbmsmusings.blogspot.com, 2019 年 6 月. 永久存档于 perma.cc/JGS7-BZFY
[11] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.Highly Available Transactions: Virtues and Limitations. Proceedings of the VLDB Endowment, volume 7, issue 3, pages 181–192, November 2013. doi.14778/2732232.2732237, extended version published as arXiv.0309
[11] Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, 和 Ion Stoica. 高可用性事务:优点与局限. VLDB 基金会会议录, 第 7 卷, 第 3 期, 第 181-192 页, 2013 年 11 月. doi.14778 / 2732232.2732237, 扩展版本发表于 arXiv.0309
[12] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman.Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at microsoft.com.
[12] Philip A. Bernstein, Vassos Hadzilacos, 和 Nathan Goodman. 数据库系统的并发控制与恢复. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, 可在线获取于 microsoft.com.
[13] Andrei Matei.CockroachDB’s consistency model.cockroachlabs.com, February 2021. Archived at perma.cc/MR38-883B
[13] Andrei Matei. CockroachDB 的一致性模型. cockroachlabs.com, 2021 年 2 月. 永久存档于 perma.cc/MR38-883B
[14] Murat Demirbas.Strict-serializability, but at what cost, for what purpose?muratbuffalo.blogspot.com, August 2022. Archived at perma.cc/T8AY-N3U9
[14] Murat Demirbas. 严格串行化,但代价是什么,目的是什么?muratbuffalo.blogspot.com, 2022 年 8 月. 永久存档于 perma.cc / T8AY - N3U9
[15] Ben Darnell.How to talk about consistency and isolation in distributed DBs. cockroachlabs.com, February 2022. Archived at perma.cc/53SV-JBGK
[15] Ben Darnell. 如何在分布式数据库中谈论一致性和隔离性. cockroachlabs.com, 2022 年 2 月. 永久存档于 perma.cc / 53SV - JBGK
[16] Daniel Abadi.An explanation of the difference between Isolation levels vs. Consistency levels.dbmsmusings.blogspot.com, August 2019. Archived at perma.cc/QSF2-CD4P
[16] Daniel Abadi. 解释隔离级别与一致性级别之间的区别. dbmsmusings.blogspot.com, 2019 年 8 月. 永久存档于 perma.cc/QSF2-CD4P
[17] Mike Burrows.The Chubby Lock Service for Loosely-Coupled Distributed Systems. At 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.
[17] Mike Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. 在第 7 届 USENIX 操作系统设计与实现研讨会(OSDI)上,2006 年 11 月。
[18] Flavio P. Junqueira and Benjamin Reed.ZooKeeper: Distributed Process Coordination. O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[19] Murali Vallath.Oracle 10g RAC Grid, Services & Clustering. Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7
[20] Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.Coordination Avoidance in Database Systems.Proceedings of the VLDB Endowment, volume 8, issue 3, pages 185–196, November 2014.doi.14778/2735508.2735509
[20] Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. Coordination Avoidance in Database Systems. VLDB 基金会会议录,第 8 卷,第 3 期,第 185-196 页,2014 年 11 月。doi.14778/2735508.2735509
[21] Kyle Kingsbury.Call Me Maybe: etcd and Consul. aphyr.com, June 2014. Archived at perma.cc/XL7U-378K
[21] Kyle Kingsbury. Call Me Maybe: etcd 和 Consul. aphyr.com, 2014 年 6 月. 永久存档于 perma.cc / XL7U - 378K
[22] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. Zab: High-Performance Broadcast for Primary-Backup Systems. At 41st IEEE International Conference on Dependable Systems and Networks (DSN), June 2011.doi.1109/DSN.2011.5958223
[22] Flavio P. Junqueira, Benjamin C. Reed, 和 Marco Serafini. Zab: 高性能广播协议用于主备系统. 在第 41 届 IEEE 可靠系统与网络国际会议(DSN)上, 2011 年 6 月. doi.1109/DSN.2011.5958223
[23] Diego Ongaro and John K. Ousterhout.In Search of an Understandable Consensus Algorithm. At USENIX Annual Technical Conference (ATC), June 2014.
[23] Diego Ongaro 和 John K. Ousterhout. 寻找一个易于理解的共识算法. 在 USENIX 年度技术会议(ATC)上, 2014 年 6 月.
[24] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev.Sharing Memory Robustly in Message-Passing Systems. Journal of the ACM, volume 42, issue 1, pages 124–142, January 1995.doi.1145/200836.200869
[24] Hagit Attiya, Amotz Bar-Noy, 和 Danny Dolev. 在消息传递系统中可靠地共享内存. ACM 期刊, 第 42 卷, 第 1 期, 第 124-142 页, 1995 年 1 月. doi.1145/200836.200869
[25] Nancy Lynch and Alex Shvartsman.Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts. At 27th Annual International Symposium on Fault-Tolerant Computing (FTCS), June 1997.doi.1109/FTCS.1997.614100
[25] Nancy Lynch 和 Alex Shvartsman. 基于动态仲裁确认广播的共享内存鲁棒模拟. 在第 27 届年度容错计算国际研讨会 (FTCS), 1997 年 6 月. doi.1109/FTCS.1997.614100
[26] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues.Introduction to Reliable and Secure Distributed Programming, 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7,doi.1007/978-3-642-15260-3
[26] Christian Cachin, Rachid Guerraoui, 和 Luís Rodrigues. 可靠和安全分布式编程导论,第二版. Springer, 2011. ISBN: 978-3-642-15259-7, doi.1007/978-3-642-15260-3
[27] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis.Possible Issue with Read Repair? Email thread on cassandra-dev mailing list, October 2012.
[27] Niklas Ekström, Mikhail Panchenko, 和 Jonathan Ellis. 读取修复可能存在的问题?cassandra-dev 邮件列表上的邮件讨论串,2012 年 10 月.
[28] Maurice P. Herlihy.Wait-Free Synchronization.ACM Transactions on Programming Languages and Systems (TOPLAS), volume 13, issue 1, pages 124–149, January 1991.doi.1145/114005.102808
[28] Maurice P. Herlihy. 无等待同步. ACM 计算机编程语言与系统杂志 (TOPLAS), 第 13 卷, 第 1 期, 第 124–149 页, 1991 年 1 月. doi.1145/114005.102808
[29] Armando Fox and Eric A. Brewer.Harvest, Yield, and Scalable Tolerant Systems. At 7th Workshop on Hot Topics in Operating Systems (HotOS), March 1999. doi.1109/HOTOS.1999.798396
[29] Armando Fox 和 Eric A. Brewer. Harvest, Yield, 和 可扩展容错系统. 在第 7 届操作系统热点研讨会(HotOS)上,1999 年 3 月. doi.1109/HOTOS.1999.798396
[30] Seth Gilbert and Nancy Lynch.Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.ACM SIGACT News, volume 33, issue 2, pages 51–59, June 2002.doi.1145/564585.564601
[30] Seth Gilbert 和 Nancy Lynch. Brewer 猜想和一致、可用、分区容错型 Web 服务的可行性. ACM SIGACT News, 第 33 卷, 第 2 期, 第 51-59 页, 2002 年 6 月. doi.1145/564585.564601
[31] Seth Gilbert and Nancy Lynch.Perspectives on the CAP Theorem. IEEE Computer Magazine, volume 45, issue 2, pages 30–36, February 2012.doi.1109/MC.2011.389
[31] Seth Gilbert 和 Nancy Lynch. 关于 CAP 定理的视角. IEEE 计算机杂志, 第 45 卷, 第 2 期, 第 30-36 页, 2012 年 2 月. doi.1109/MC.2011.389
[32] Eric A. Brewer.CAP Twelve Years Later: How the ‘Rules’ Have Changed. IEEE Computer Magazine, volume 45, issue 2, pages 23–29, February 2012. doi.1109/MC.2012.37
[32] Eric A. Brewer. CAP 十二年后:规则如何改变. IEEE 计算机杂志, 第 45 卷, 第 2 期, 第 23-29 页, 2012 年 2 月. doi.1109/MC.2012.37
[33] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen.Consistency in Partitioned Networks. ACM Computing Surveys, volume 17, issue 3, pages 341–370, September 1985.doi.1145/5505.5508
[33] 苏珊・B・戴维森、赫克托・加西亚 - 莫利纳和戴尔・斯金。分区网络中的数据一致性。ACM 计算机调查,第 17 卷,第 3 期,第 341–370 页,1985 年 9 月。doi.1145/5505.5508
[34] Paul R. Johnson and Robert H. Thomas.RFC 677: The Maintenance of Duplicate Databases. Network Working Group, January 1975.
[34] 保罗・R・约翰逊和罗伯特・H・托马斯。RFC 677:重复数据库的维护。网络工作组,1975 年 1 月。
[35] Michael J. Fischer and Alan Michael.Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network. At 1st ACM Symposium on Principles of Database Systems (PODS), March 1982.doi.1145/588111.588124
[35] 迈克尔・J・费舍尔和艾伦・迈克尔。为了在不可靠网络中获得高数据可用性而牺牲串行化。在第一届 ACM 数据库系统原理研讨会(PODS)上,1982 年 3 月。doi.1145/588111.588124
[36] Eric A. Brewer.NoSQL: Past, Present, Future. At QCon San Francisco, November 2012.
[36] 埃里克・A・布鲁尔。NoSQL:过去、现在与未来。在 2012 年旧金山 QCon 上。
[37] Adrian Cockcroft.Migrating to Microservices. At QCon London, March 2014.
[37] Adrian Cockcroft. 迁移到微服务。在 2014 年 3 月的 QCon 伦敦。
[38] Martin Kleppmann.A Critique of the CAP Theorem. arXiv.05393, September 2015.
[38] Martin Kleppmann. 对 CAP 定理的批判. arXiv.05393, 2015 年 9 月.
[39] Daniel Abadi.Problems with CAP, and Yahoo’s little known NoSQL system. dbmsmusings.blogspot.com, April 2010. Archived at perma.cc/4NTZ-CLM9
[39] Daniel Abadi. CAP 的问题,以及雅虎鲜为人知的 NoSQL 系统. dbmsmusings.blogspot.com, 2010 年 4 月. 永久存档于 perma.cc / 4NTZ - CLM9
[40] Daniel Abadi.Hazelcast and the Mythical PA/EC System. dbmsmusings.blogspot.com, October 2017. Archived at perma.cc/J5XM-U5C2
[40] Daniel Abadi. Hazelcast 与神话般的 PA/EC 系统. dbmsmusings.blogspot.com, 2017 年 10 月. 永久存档于 perma.cc/J5XM - U5C2
[41] Eric Brewer.Spanner, TrueTime & The CAP Theorem. research.google.com, February 2017. Archived at perma.cc/59UW-RH7N
[41] Eric Brewer. Spanner, TrueTime & The CAP Theorem. research.google.com, 2017 年 2 月. 永久存档于 perma.cc / 59UW - RH7N
[42] Daniel J. Abadi.Consistency Tradeoffs in Modern Distributed Database System Design. IEEE Computer Magazine, volume 45, issue 2, pages 37–42, February 2012.doi.1109/MC.2012.33
[42] Daniel J. Abadi. Modern Distributed Database System Design 中的一致性权衡. IEEE Computer Magazine, 第 45 卷, 第 2 期, 第 37-42 页, 2012 年 2 月. doi.1109/MC.2012.33
[43] Nancy A. Lynch.A Hundred Impossibility Proofs for Distributed Computing. At 8th ACM Symposium on Principles of Distributed Computing (PODC), August 1989.doi.1145/72981.72982
[43] Nancy A. Lynch. 分布式计算的百种不可能证明. 在第 8 届 ACM 分布式计算原理研讨会 (PODC), 1989 年 8 月. doi.1145/72981.72982
[44] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin.Consistency, Availability, and Convergence. University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. Archived at perma.cc/SAV8-9JAJ
[44] Prince Mahajan, Lorenzo Alvisi, 和 Mike Dahlin. 一致性、可用性和收敛性. 德克萨斯大学奥斯汀分校计算机科学系, 技术报告 UTCS TR-11-22, 2011 年 5 月. 永久存档于 perma.cc/SAV8-9JAJ
[45] Hagit Attiya, Faith Ellen, and Adam Morrison.Limitations of Highly-Available Eventually-Consistent Data Stores. At ACM Symposium on Principles of Distributed Computing (PODC), July 2015.doi.1145/2767386.2767419
[45] Hagit Attiya、Faith Ellen 和 Adam Morrison。高可用最终一致性数据存储的局限性。在 ACM 分布式计算原理研讨会(PODC),2015 年 7 月。doi.1145/2767386.2767419
[46] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen.x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors. Communications of the ACM, volume 53, issue 7, pages 89–97, July 2010.doi.1145/1785414.1785443
[46] Peter Sewell、Susmit Sarkar、Scott Owens、Francesco Zappa Nardelli 和 Magnus O. Myreen。x86-TSO:一种严格且实用的 x86 多处理器程序员模型。ACM 通讯,第 53 卷,第 7 期,第 89-97 页,2010 年 7 月。doi.1145/1785414.1785443
[47] Martin Thompson.Memory Barriers/Fences. mechanical-sympathy.blogspot.co.uk, July 2011. Archived at perma.cc/7NXM-GC5U
[47] Martin Thompson。内存屏障/栅栏。mechanical-sympathy.blogspot.co.uk,2011 年 7 月。存档于 perma.cc/7NXM - GC5U
[48] Ulrich Drepper.What Every Programmer Should Know About Memory. akkadia.org, November 2007. Archived at perma.cc/NU6Q-DRXZ
[48] Ulrich Drepper. 每个程序员都应该知道的关于内存的知识。akkadia.org, 2007 年 11 月。存档于 perma.cc / NU6Q - DRXZ
[49] Hagit Attiya and Jennifer L. Welch.Sequential Consistency Versus Linearizability. ACM Transactions on Computer Systems (TOCS), volume 12, issue 2, pages 91–122, May 1994.doi.1145/176575.176576
[49] Hagit Attiya 和 Jennifer L. Welch。顺序一致性与线性化。ACM 计算机系统通讯(TOCS),第 12 卷,第 2 期,第 91-122 页,1994 年 5 月。doi.1145/176575.176576
[50] Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach.Universally Unique IDentifiers (UUIDs). RFC 9562, IETF, May 2024.
[50] Kyzer R. Davis, Brad G. Peabody, 和 Paul J. Leach. 通用唯一标识符 (UUIDs)。RFC 9562, IETF, 2024 年 5 月。
[51] Ryan King.Announcing Snowflake.blog.x.com, June 2010. Archived at archive.org
[51] Ryan King. 宣布 Snowflake。blog.x.com, 2010 年 6 月。存档于 archive.org
[52] Alizain Feerasta.Universally Unique Lexicographically Sortable Identifier.github.com, 2016. Archived at perma.cc/NV2Y-ZP8U
[52] Alizain Feerasta. 通用唯一可字典排序标识符。github.com, 2016 年。存档于 perma.cc / NV2Y - ZP8U
[53] Rob Conery.A Better ID Generator for PostgreSQL. bigmachine.io, May 2014. Archived at perma.cc/K7QV-3KFC
[53] Rob Conery. 《为 PostgreSQL 设计更好的 ID 生成器》. bigmachine.io, 2014 年 5 月. 归档于 perma.cc / K7QV - 3KFC
[54] Leslie Lamport.Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, volume 21, issue 7, pages 558–565, July 1978.doi.1145/359545.359563
[54] Leslie Lamport. 《分布式系统中的时间、时钟和事件排序》. ACM 通讯, 第 21 卷, 第 7 期, 第 558-565 页, 1978 年 7 月. doi.1145/359545.359563
[55] Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone.Logical Physical Clocks.18th International Conference on Principles of Distributed Systems (OPODIS), December 2014.doi.1007/978-3-319-14472-6_2
[55] Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, 和 Marcelo Leone. 《逻辑物理时钟》. 第 18 届分布式系统原理国际会议 (OPODIS), 2014 年 12 月. doi.1007/978-3-319-14472-6_2
[56] Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues.On the use of Clocks to Enforce Consistency in the Cloud. IEEE Data Engineering Bulletin, volume 38, issue 1, pages 18–31, March 2015. Archived at perma.cc/68ZU-45SH
[56] Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, 和 Luís Rodrigues. 《使用时钟在云中维护一致性》. IEEE 数据工程简报, 第 38 卷, 第 1 期, 第 18-31 页, 2015 年 3 月. 归档于 perma.cc / 68ZU - 45SH
[57] Daniel Peng and Frank Dabek.Large-Scale Incremental Processing Using Distributed Transactions and Notifications. At 9th USENIX Conference on Operating Systems Design and Implementation (OSDI), October 2010.
[57] Daniel Peng 和 Frank Dabek. 使用分布式事务和通知进行大规模增量处理. 在第 9 届 USENIX 操作系统设计与实现会议(OSDI),2010 年 10 月.
[58] Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live – An Engineering Perspective. At 26th ACM Symposium on Principles of Distributed Computing (PODC), June 2007.doi.1145/1281100.1281103
[58] Tushar Deepak Chandra, Robert Griesemer 和 Joshua Redstone. Paxos 实践 —— 工程视角. 在第 26 届 ACM 分布式计算原理会议(PODC),2007 年 6 月. doi.1145/1281100.1281103
[59] Will Portnoy.Lessons Learned from Implementing Paxos. blog.willportnoy.com, June 2012. Archived at perma.cc/QHD9-FDD2
[59] Will Portnoy. 实现 Paxos 的教训. blog.willportnoy.com, 2012 年 6 月. 永久存档于 perma.cc/QHD9-FDD2
[60] Brian M. Oki and Barbara H. Liskov.Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. At 7th ACM Symposium on Principles of Distributed Computing (PODC), August 1988.doi.1145/62546.62549
[60] Brian M. Oki 和 Barbara H. Liskov. 视图标记复制:一种支持高可用分布式系统的新主副本方法. 在第 7 届 ACM 分布式计算原理会议(PODC),1988 年 8 月. doi.1145/62546.62549
[61] Barbara H. Liskov and James Cowling.Viewstamped Replication Revisited. Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. Archived at perma.cc/56SJ-WENQ
[61] Barbara H. Liskov 和 James Cowling. Viewstamped Replication Revisited. 麻省理工学院技术报告 MIT - CSAIL-TR-2012-021, 2012 年 7 月. 存档于 perma.cc / 56SJ - WENQ
[62] Leslie Lamport.The Part-Time Parliament. ACM Transactions on Computer Systems, volume 16, issue 2, pages 133–169, May 1998.doi.1145/279227.279229
[62] Leslie Lamport. The Part-Time Parliament. ACM 计算机系统学报, 第 16 卷, 第 2 期, 第 133–169 页, 1998 年 5 月. doi.1145/279227.279229
[63] Leslie Lamport.Paxos Made Simple. ACM SIGACT News, volume 32, issue 4, pages 51–58, December 2001. Archived at perma.cc/82HP-MNKE
[63] Leslie Lamport. Paxos Made Simple. ACM SIGACT 新闻, 第 32 卷, 第 4 期, 第 51–58 页, 2001 年 12 月. 存档于 perma.cc / 82HP - MNKE
[64] Robbert van Renesse and Deniz Altinbuken.Paxos Made Moderately Complex. ACM Computing Surveys (CSUR), volume 47, issue 3, article no. 42, February 2015. doi.1145/2673577
[64] Robbert van Renesse 和 Deniz Altinbuken. Paxos Made Moderately Complex. ACM 计算机调查报告 (CSUR), 第 47 卷, 第 3 期, 第 42 篇, 2015 年 2 月. doi.1145/2673577
[65] Diego Ongaro.Consensus: Bridging Theory and Practice. PhD Thesis, Stanford University, August 2014. Archived at perma.cc/5VTZ-2ADH
[65] Diego Ongaro. 共识:连接理论与实践. 博士论文, 斯坦福大学, 2014 年 8 月. 存档于 perma.cc / 5VTZ - 2ADH
[66] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft.Raft Refloated: Do We Have Consensus?ACM SIGOPS Operating Systems Review, volume 49, issue 1, pages 12–21, January 2015.doi.1145/2723872.2723876
[66] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, 和 Jon Crowcroft. Raft 重启:我们达成共识了吗?ACM SIGOPS 操作系统评论, 第 49 卷, 第 1 期, 第 12-21 页, 2015 年 1 月. doi.1145/2723872.2723876
[67] André Medeiros.ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice. Aalto University School of Science, March 2012. Archived at perma.cc/FVL4-JMVA
[67] André Medeiros. ZooKeeper 的原子广播协议:理论与实践. 阿尔托大学科学学院, 2012 年 3 月. 存档于 perma.cc/FVL4-JMVA
[68] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab. IEEE Transactions on Dependable and Secure Computing, volume 12, issue 4, pages 472–484, September 2014.doi.1109/TDSC.2014.2355848
[68] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. 活力无限:Paxos 与视图标记复制与 Zab 的比较. IEEE 可靠与安全计算汇刊, 第 12 卷, 第 4 期, 第 472-484 页, 2014 年 9 月. doi.1109/TDSC.2014.2355848
[69] Heidi Howard and Richard Mortier.Paxos vs Raft: Have we reached consensus on distributed consensus?. At 7th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC), April 2020.doi.1145/3380787.3393681
[69] Heidi Howard 和 Richard Mortier. Paxos 与 Raft:我们是否就分布式一致性达成了共识?. 在第 7 届分布式数据一致性原理与实践研讨会 (PaPoC), 2020 年 4 月. doi.1145/3380787.3393681
[70] Miguel Castro and Barbara H. Liskov.Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Transactions on Computer Systems, volume 20, issue 4, pages 396–461, November 2002.doi.1145/571637.571640
[70] Miguel Castro 和 Barbara H. Liskov. 实用的拜占庭容错与主动恢复. ACM 计算机系统汇刊, 第 20 卷, 第 4 期, 第 396-461 页, 2002 年 11 月. doi.1145/571637.571640
[71] Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis.SoK: Consensus in the Age of Blockchains. At 1st ACM Conference on Advances in Financial Technologies (AFT), October 2019.doi.1145/3318041.3355458
[71] Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, 和 George Danezis. SoK: 区块链时代的共识. 在第 1 届 ACM 金融技术进展会议 (AFT), 2019 年 10 月. doi.1145/3318041.3355458
[72] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson.Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, volume 32, issue 2, pages 374–382, April 1985.doi.1145/3149.214121
[72] Michael J. Fischer, Nancy Lynch, 和 Michael S. Paterson. 带有一个故障进程的分布式共识的不可能性. ACM 学报, 卷 32, 期 2, 页 374–382, 1985 年 4 月. doi.1145/3149.214121
[73] Tushar Deepak Chandra and Sam Toueg.Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, volume 43, issue 2, pages 225–267, March 1996.doi.1145/226643.226647
[73] Tushar Deepak Chandra 和 Sam Toueg. 可靠分布式系统的不可靠故障检测器. ACM 学报, 卷 43, 期 2, 页 225–267, 1996 年 3 月. doi.1145/226643.226647
[74] Michael Ben-Or.Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. At 2nd ACM Symposium on Principles of Distributed Computing (PODC), August 1983.doi.1145/800221.806707
[74] Michael Ben-Or. 自由选择的优势:完全异步协议. 在第 2 届 ACM 分布式计算原理研讨会 (PODC), 1983 年 8 月. doi.1145/800221.806707
[75] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer.Consensus in the Presence of Partial Synchrony. Journal of the ACM, volume 35, issue 2, pages 288–323, April 1988.doi.1145/42282.42283
[75] Cynthia Dwork, Nancy Lynch, 和 Larry Stockmeyer. 部分同步下的共识. ACM 计算机学会杂志, 卷 35, 第 2 期, 页 288–323, 1988 年 4 月. doi.1145/42282.42283
[76] Xavier Défago, André Schiper, and Péter Urbán.Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Computing Surveys, volume 36, issue 4, pages 372–421, December 2004.doi.1145/1041680.1041682
[76] Xavier Défago, André Schiper, 和 Péter Urbán. 全序广播和多播算法:分类和综述. ACM 计算机调查, 卷 36, 第 4 期, 页 372–421, 2004 年 12 月. doi.1145/1041680.1041682
[77] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6,doi.1002/0471478210
[77] Hagit Attiya 和 Jennifer Welch. 分布式计算:基础、模拟和高级主题,第二版. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, doi.1002/0471478210
[78] Rachid Guerraoui.Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus. At 9th International Workshop on Distributed Algorithms (WDAG), September 1995.doi.1007/BFb0022140
[78] Rachid Guerraoui. 重访非阻塞原子提交与共识的关系. 在第 9 届国际分布式算法研讨会 (WDAG), 1995 年 9 月. doi.1007/BFb0022140
[79] Jim N. Gray and Leslie Lamport.Consensus on Transaction Commit. ACM Transactions on Database Systems (TODS), volume 31, issue 1, pages 133–160, March 2006. doi.1145/1132863.1132867
[79] Jim N. Gray 和 Leslie Lamport. 事务提交的共识. ACM 数据库系统学报(TODS),第 31 卷,第 1 期,第 133–160 页,2006 年 3 月. doi.1145/1132863.1132867
[80] Fred B. Schneider.Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Computing Surveys, volume 22, issue 4, pages 299–319, December 1990.doi.1145/98163.98167
[80] Fred B. Schneider. 使用状态机方法实现容错服务:教程. ACM 计算机综述,第 22 卷,第 4 期,第 299–319 页,1990 年 12 月. doi.1145/98163.98167
[81] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi.Calvin: Fast Distributed Transactions for Partitioned Database Systems. At ACM International Conference on Management of Data (SIGMOD), May 2012.doi.1145/2213836.2213838
[81] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, 和 Daniel J. Abadi. Calvin:分区数据库系统的快速分布式事务. 在 ACM 国际数据管理会议(SIGMOD),2012 年 5 月. doi.1145/2213836.2213838
[82] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck.Tango: Distributed Data Structures over a Shared Log. At 24th ACM Symposium on Operating Systems Principles (SOSP), November 2013.doi.1145/2517349.2522732
[82] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, 和 Aviad Zuck. Tango: 在共享日志上的分布式数据结构. 在第 24 届 ACM 操作系统原理研讨会 (SOSP), 2013 年 11 月. doi.1145/2517349.2522732
[83] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis.CORFU: A Shared Log Design for Flash Clusters. At 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.
[83] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, 和 John D. Davis. CORFU:闪存集群的共享日志设计. 在第 9 届 USENIX 网络系统设计与实现会议(NSDI),2012 年 4 月.
[84] Vasilis Gavrielatos, Antonios Katsarakis, and Vijay Nagarajan.Odyssey: the impact of modern hardware on strongly-consistent replication protocols. At 16th European Conference on Computer Systems (EuroSys), April 2021.doi.1145/3447786.3456240
[84] Vasilis Gavrielatos, Antonios Katsarakis, 和 Vijay Nagarajan. Odyssey: 现代硬件对强一致性复制协议的影响. 在第 16 届欧洲计算机系统会议 (EuroSys), 2021 年 4 月. doi.1145/3447786.3456240
[85] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman.Flexible Paxos: Quorum Intersection Revisited. At 20th International Conference on Principles of Distributed Systems (OPODIS), December 2016.doi.4230/LIPIcs.OPODIS.2016.25
[85] Heidi Howard, Dahlia Malkhi, 和 Alexander Spiegelman. 弹性 Paxos: 决策交集再探. 在第 20 届分布式系统原理国际会议 (OPODIS), 2016 年 12 月. doi.4230/LIPIcs.OPODIS.2016.25
[86] Martin Kleppmann.Distributed Systems lecture notes. University of Cambridge, October 2024. Archived at perma.cc/SS3Q-FNS5
[86] Martin Kleppmann. 分布式系统讲义笔记. 剑桥大学, 2024 年 10 月. 保存在 perma.cc / SS3Q - FNS5
[87] Kyle Kingsbury.Call Me Maybe: Elasticsearch 1.5.0. aphyr.com, April 2015. Archived at perma.cc/37MZ-JT7H
[87] Kyle Kingsbury. Call Me Maybe: Elasticsearch 1.5.0. aphyr.com, April 2015. 归档于 perma.cc / 37MZ - JT7H
[88] Heidi Howard and Jon Crowcroft.Coracle: Evaluating Consensus at the Internet Edge. At Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015.doi.1145/2829988.2790010
[88] Heidi Howard 和 Jon Crowcroft. Coracle: 评估互联网边缘的共识机制. 在 ACM 数据通信特别兴趣小组 (SIGCOMM) 年会, 2015 年 8 月. doi.1145/2829988.2790010
[89] Tom Lianza and Chris Snook.A Byzantine failure in the real world. blog.cloudflare.com, November 2020. Archived at perma.cc/83EZ-ALCY
[89] Tom Lianza 和 Chris Snook. 真实世界中的拜占庭式故障. blog.cloudflare.com, 2020 年 11 月. 归档于 perma.cc / 83EZ - ALCY
[90] Ivan Kelly.BookKeeper Tutorial.github.com, October 2014. Archived at perma.cc/37Y6-VZWU
[90] Ivan Kelly. BookKeeper 教程. github.com, 2014 年 10 月. 归档于 perma.cc / 37Y6 - VZWU
[91] Jack Vanlightly.Apache BookKeeper Insights Part 1 — External Consensus and Dynamic Membership. medium.com, November 2021. Archived at perma.cc/3MDB-8GFB