9. The Trouble with Distributed Systems

53.2K
0
0
最后修改于

Chapter 9. The Trouble with Distributed Systems 第九章 分布式系统的麻烦#

They’re funny things, Accidents. You never have them till you’re having them.
事故真是奇怪的东西,你从未遇到,直到你遇到了。

A.A. Milne, The House at Pooh Corner (1928)
A.A. 米尔恩,《小熊维尼的家》(1928 年)

As discussed in “Reliability and Fault Tolerance”, making a system reliable means ensuring that the system as a whole continues working, even when things go wrong (i.e., when there is a fault). However, anticipating all the possible faults and handling them is not that easy. As a developer, it is very tempting to focus mostly on the happy path (after all, most of the time things work fine!) and to neglect faults, since they introduce a lot of edge cases.
如 “可靠性和容错性” 所述,使系统可靠意味着确保整个系统在出现问题时(即存在故障时)继续运行。然而,预见所有可能的故障并处理它们并不那么容易。作为开发者,专注于快乐路径(毕竟,大多数时候事情都运行良好!)并忽视故障非常有诱惑力,因为它们引入了许多边缘情况。

If you want your system to be reliable in the presence of faults you have to radically change your mindset, and focus on the things that could go wrong, even though they may be unlikely. It doesn’t matter whether there is only a one-in-a-million chance of a thing going wrong: in a large enough system, one-in-a-million events happen every day. Experienced systems operators will tell you that anything that can go wrong will go wrong.
如果你希望你的系统能在出现故障时保持可靠,你必须从根本上改变你的思维方式,关注那些可能出错的事情,即使它们可能不太可能发生。无论出错的概率只有百万分之一:在一个足够大的系统中,百万分之一的事件每天都在发生。有经验的系统操作人员会告诉你,任何可能出错的事情最终都会出错。

Moreover, working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and exciting ways for things to go wrong [1,2]. In this chapter, you will get a taste of the problems that arise in practice, and an understanding of the things you can and cannot rely on.
此外,与在单台计算机上编写软件相比,处理分布式系统从根本上来说是不同的 —— 主要区别在于出现了许多新的、令人兴奋的出错方式 [1, 2]。在本章中,你将初步了解实践中出现的问题,并理解你可以和不可以依赖的事情。

To understand what challenges we are up against, we will now turn our pessimism to the maximum and explore the things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”) as well as clocks and timing issues (“Unreliable Clocks”). The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”). Later, in Chapter 10, we will look at some examples of how we can achieve fault tolerance in the face of those faults.
为了了解我们面临的挑战,我们将把悲观情绪推向极致,探讨分布式系统中可能出错的事情。我们将研究网络问题("不可靠的网络")以及时钟和时间问题("不可靠的时钟")。所有这些问题带来的后果令人困惑,因此我们将探讨如何思考分布式系统的状态以及如何推理已经发生的事情("知识、真相与谎言")。稍后,在第 10 章中,我们将探讨如何在面对这些故障时实现容错性。

Faults and Partial Failures 故障和部分失效#

When you are writing a program on a single computer, it normally behaves in a fairly predictable way: either it works or it doesn’t. Buggy software may give the appearance that the computer is sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just a consequence of badly written software.
当你在一台单机上编写程序时,它通常表现得相当可预测:要么它工作,要么它不工作。有缺陷的软件可能会给人一种计算机有时 "心情不好" 的印象(这个问题通常通过重启来解决),但这大多是编写糟糕的软件的后果。

There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic). If there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual computer with good software is usually either fully functional or entirely broken, but not something in between.
单台计算机上的软件出现不稳定现象并没有根本原因:当硬件正常工作时,相同的操作总是产生相同的结果(它是确定性的)。如果出现硬件问题(例如内存损坏或松动的连接器),后果通常是系统完全失效(例如内核恐慌、蓝屏死机、无法启动)。一台拥有良好软件的计算机通常要么完全正常工作,要么完全损坏,但不会处于两者之间。

This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a computer to crash completely rather than returning a wrong result, because wrong results are difficult and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are implemented and present an idealized system model that operates with mathematical perfection. A CPU instruction always does the same thing; if you write some data to memory or disk, that data remains intact and doesn’t get randomly corrupted. As discussed in “Hardware and Software Faults”, this is not actually true—in reality, data does get silently corrupted and CPUs do sometimes silently return the wrong result—but it happens rarely enough that we can get away with ignoring it.
这是计算机设计中的一个有意选择:如果发生内部故障,我们宁愿计算机完全崩溃,也不愿返回错误的结果,因为错误的结果难以处理且令人困惑。因此,计算机隐藏了它们所依赖的模糊物理现实,并呈现一个理想化的系统模型,该模型以数学上的完美性运行。CPU 指令总是执行相同的事情;如果你将某些数据写入内存或磁盘,这些数据将保持完整,不会随机损坏。正如在 “硬件和软件故障” 中讨论的那样,这实际上并不完全正确 —— 在现实中,数据确实会无声地损坏,CPU 有时确实会无声地返回错误结果 —— 但这些情况发生的频率足够低,以至于我们可以忽略它们。

When you are writing software that runs on several computers, connected by a network, the situation is fundamentally different. In distributed systems, faults occur much more frequently, and so we can no longer ignore them—we have no choice but to confront the messy reality of the physical world. And in the physical world, a remarkably wide range of things can go wrong, as illustrated by this anecdote [3]:
当你编写在多台计算机上运行的软件,这些计算机通过网络连接时,情况就根本不同了。在分布式系统中,故障发生的频率要高得多,因此我们不能再忽视它们 —— 我们别无选择,只能直面物理世界的混乱现实。在物理世界中,有大量的事情可能出错,这一点可以通过这个轶事 [3] 来说明。

In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even an ops guy.
在我的有限经验中,我曾处理过单个数据中心(DC)中的长期网络分区、PDU [电源分配单元] 故障、交换机故障、整个机架意外断电、整个数据中心的骨干网络故障、整个数据中心的电源故障,以及一名低血糖司机驾驶他的福特皮卡撞向数据中心的 HVAC [供暖、通风和空调] 系统。而我甚至不是运维人员。

Coda Hale 科达・黑尔

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure. The difficulty is that partial failures are nondeterministic: if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. As we shall see, you may not even know whether something succeeded or not!
在分布式系统中,即使其他部分正常运行,也可能存在某些部分以不可预测的方式损坏。这种现象被称为部分故障。困难在于部分故障是非确定性的:如果你尝试执行涉及多个节点和网络的任何操作,它有时可能成功,有时会不可预测地失败。正如我们将看到的,你可能甚至不知道某个操作是否成功!

This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [4]. On the other hand, if a distributed system can tolerate partial failures, that opens up powerful possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time to install software updates while the system as a whole continues working uninterrupted all the time. Fault tolerance therefore allows us to make distributed systems more reliable than single-node systems: we can build a reliable system from unreliable components.
这种非确定性和部分故障的可能性使得分布式系统难以处理 [4]。另一方面,如果分布式系统能够容忍部分故障,那将开辟出强大的可能性:例如,它允许你进行滚动升级,逐个重启节点来安装软件更新,同时整个系统持续不间断地运行。因此,容错性使我们能够使分布式系统比单节点系统更可靠:我们可以用不可靠的组件构建一个可靠的系统。

But before we can implement fault tolerance, we need to know more about the faults that we’re supposed to tolerate. It is important to consider a wide range of possible faults—even fairly unlikely ones—and to artificially create such situations in your testing environment to see what happens. In distributed systems, suspicion, pessimism, and paranoia pay off.
但在我们能够实现容错之前,我们需要更多地了解我们预期要容忍的故障。考虑各种可能的故障 —— 即使是相当不太可能发生的故障 —— 并人为地在测试环境中创造这些情况以观察会发生什么,这是很重要的。在分布式系统中,怀疑、悲观和偏执是有益的。

Unreliable Networks 不可靠的网络#

As discussed in “Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”, the distributed systems we focus on in this book are mostly shared-nothing systems: i.e., a bunch of machines connected by a network. The network is the only way those machines can communicate—we assume that each machine has its own memory and disk, and one machine cannot access another machine’s memory or disk (except by making requests to a service over the network). Even when storage is shared, such as with Amazon’s S3, machines communicate with shared storage services over the network.
正如在 “共享内存、共享磁盘和共享无架构” 中讨论的,本书关注的分布式系统大多是共享无系统:即由网络连接的一组机器。网络是这些机器之间唯一的通信方式 —— 我们假设每台机器都有自己的内存和磁盘,一台机器不能访问另一台机器的内存或磁盘(除了通过网络请求服务)。即使存储是共享的,比如亚马逊的 S3,机器也是通过网络与共享存储服务进行通信。

The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks. In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send a request and expect a response, many things could go wrong (some of which are illustrated in Figure 9-1):
互联网和数据中心中的大多数内部网络(通常是以太网)都是异步分组网络。在这种网络中,一个节点可以向另一个节点发送消息(一个分组),但网络不保证它何时到达,或者是否能够到达。如果你发送一个请求并期望得到响应,可能会发生许多问题(其中一些在图 9-1 中进行了说明):

  1. Your request may have been lost (perhaps someone unplugged a network cable).
    您的请求可能已丢失(也许有人拔掉了网络线缆)。
  2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded).
    您的请求可能正在队列中等待,稍后会送达(也许网络或接收方过载)。
  3. The remote node may have failed (perhaps it crashed or it was powered down).
    远程节点可能已失效(也许它崩溃了或被关闭了电源)。
  4. The remote node may have temporarily stopped responding (perhaps it is experiencing a long garbage collection pause; see “Process Pauses”), but it will start responding again later.
    远程节点可能暂时停止响应(也许它正在经历长时间的垃圾回收暂停;参见 “进程暂停”),但稍后会再次开始响应。
  5. The remote node may have processed your request, but the response has been lost on the network (perhaps a network switch has been misconfigured).
    远程节点可能已经处理了您的请求,但响应在网络中丢失了(也许网络交换机配置错误)。
  6. The remote node may have processed your request, but the response has been delayed and will be delivered later (perhaps the network or your own machine is overloaded).
    远程节点可能已经处理了您的请求,但响应延迟了,稍后会送达(也许网络或您的机器过载了)。
    ddia 0901

The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed. These issues are indistinguishable in an asynchronous network: the only information you have is that you haven’t received a response yet. If you send a request to another node and don’t receive a response, it is impossible to tell why.
发送者甚至无法判断数据包是否已送达:唯一的选择是接收者发送响应消息,而该消息也可能丢失或延迟。在异步网络中,这些问题是无法区分的:你唯一获得的信息是你尚未收到响应。如果你向另一个节点发送请求而没有收到响应,你无法判断原因。

The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it).
处理此问题的常用方法是超时:经过一段时间后你放弃等待,并假设响应不会到来。然而,当超时发生时,你仍然不知道远程节点是否收到了你的请求(如果请求仍在某个地方排队,它可能仍然会被发送给接收者,即使发送者已经放弃)。

The Limitations of TCPTCP 的局限性#

Network packets have a maximum size (generally a few kilobytes), but many applications need to send messages (requests, responses) that are too big to fit in one packet. These applications most often use TCP, the Transmission Control Protocol, to establish a connection that breaks up large data streams into individual packets, and puts them back together again on the receiving side.
网络数据包有最大尺寸(通常为几 KB),但许多应用程序需要发送的消息(请求、响应)太大而无法放入一个数据包中。这些应用程序通常使用传输控制协议(TCP)来建立连接,将大数据流拆分成单个数据包,并在接收端将它们重新组合在一起。

Note 注意#

Most of what we say about TCP applies also to its more recent alternative QUIC, as well as the Stream Control Transmission Protocol (SCTP) used in WebRTC, the BitTorrent uTP protocol, and other transport protocols. For a comparison to UDP, see “TCP Versus UDP”.
我们所说的关于 TCP 的内容也适用于其更近期的替代品 QUIC,以及 WebRTC 中使用的流控制传输协议(SCTP)、BitTorrent uTP 协议和其他传输协议。有关与 UDP 的比较,请参阅 “TCP 与 UDP”。

TCP is often described as providing “reliable” delivery, in the sense that it detects and retransmits dropped packets, it detects reordered packets and puts them back in the correct order, and it detects packet corruption using a simple checksum. It also figures out how fast it can send data so that it is transferred as quickly as possible, but without overloading the network or the receiving node; this is known as congestion control, flow control, or backpressure [5].
TCP 通常被描述为提供 “可靠” 传输,这意味着它能检测并重传丢失的数据包,检测乱序数据包并按正确顺序重排,以及使用简单的校验和检测数据包损坏。它还能确定发送数据的速度,以便尽可能快地传输数据,但不会使网络或接收节点过载;这被称为拥塞控制、流量控制或反向压力 [5]。

When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately, but it’s only placed in a buffer managed by your operating system. When the congestion control algorithm decides that it has capacity to send a packet, it takes the next packet-worth of data from that buffer and passes it to the network interface. The packet passes through several switches and routers, and eventually the receiving node’s operating system places the packet’s data in a receive buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating system notify the application that some more data has arrived [6].
当你通过写入套接字来 “发送” 一些数据时,它实际上并不会立即发送,而是被放置在你的操作系统管理的缓冲区中。当拥塞控制算法决定有发送数据包的容量时,它会从该缓冲区中取出下一个数据包的数据量,并将其传递给网络接口。数据包会通过多个交换机和路由器,最终接收节点的操作系统将数据包的数据放入接收缓冲区,并向发送方发送确认包。只有到那时,接收方的操作系统才会通知应用程序有更多数据到达 [6]。

So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment arrives within some timeout, but TCP can’t tell either whether it was the outbound packet or the acknowledgment that was lost. Although TCP can resend the packet, it can’t guarantee that the new packet will get through either. If the network cable is unplugged, TCP can’t plug it back in for you. Eventually, after a configurable timeout, TCP gives up and signals an error to the application.
所以,如果 TCP 提供了 “可靠性”,是否意味着我们不再需要担心网络不可靠的问题?不幸的是并非如此。它会判断如果在一定超时时间内没有收到确认,数据包就丢失了,但 TCP 也无法判断是出站数据包还是确认包丢失了。尽管 TCP 可以重发数据包,但它也无法保证新的数据包一定能成功传输。如果网络线被拔掉,TCP 也无法帮你重新插上。最终,在可配置的超时时间后,TCP 会放弃并通知应用程序发生错误。

If a TCP connection is closed with an error—perhaps because the remote node crashed, or perhaps because the network was interrupted—you unfortunately have no way of knowing how much data was actually processed by the remote node [6]. Even if TCP acknowledged that a packet was delivered, this only means that the operating system kernel on the remote node received it, but the application may have crashed before it handled that data. If you want to be sure that a request was successful, you need a positive response from the application itself [7].
如果 TCP 连接因错误关闭 —— 可能是远程节点崩溃了,也可能是网络中断了 —— 你很不幸无法知道远程节点实际处理了多少数据 [6]。即使 TCP 确认数据包已送达,这也仅仅意味着远程节点的操作系统内核收到了它,但应用程序可能在处理这些数据之前就崩溃了。如果你要确保请求成功,你需要应用程序本身的积极响应 [7]。

Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving messages that are too big to fit in one packet. Once a TCP connection is established, you can also use it to send multiple requests and responses. This is usually done by first sending a header that indicates the length of the following message in bytes, followed by the actual message. HTTP and many RPC protocols (see “Dataflow Through Services: REST and RPC”) work like this.
然而,TCP 非常实用,因为它提供了一种方便的方式,用于发送和接收无法容纳在一个数据包中的消息。一旦建立了 TCP 连接,你也可以用它来发送多个请求和响应。这通常是通过首先发送一个标头,该标头指示后续消息的字节数长度,然后是实际消息来完成的。HTTP 和许多 RPC 协议(参见 “服务中的数据流:REST 和 RPC”)就是如此工作的。

Network Faults in Practice 实际中的网络故障#

We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company [8]:
几十年来,我们一直在构建计算机网络 —— 人们或许希望到如今我们已经能够解决如何让它们变得可靠的问题。不幸的是,我们尚未成功。一些系统性的研究以及大量的轶事证据表明,网络问题可能出人意料地常见,即使在像由一家公司运营的数据中心这样受控的环境中也是如此 [8]:

  • One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack [9].
    一项对中等规模数据中心的研究发现,每月大约有 12 个网络故障,其中一半导致单个机器断开连接,另一半导致整个机架断开连接 [9]。
  • Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers [10]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.
    另一项研究测量了如机架顶部交换机、聚合交换机和负载均衡器等组件的故障率 [10]。研究发现,增加冗余网络设备并没有像人们希望的那样减少故障,因为它无法防范人为错误(例如配置错误的交换机),而人为错误是导致中断的主要原因。
  • Interruptions of wide-area fiber links have been blamed on cows [11], beavers [12], and sharks [13] (though shark bites have become rarer due to better shielding of submarine cables [14]). Humans are also at fault, be it due to accidental misconfiguration [15], scavenging [16], or sabotage [17].
    广域光纤链路的故障被归咎于奶牛 [11]、河狸 [12] 和鲨鱼 [13](尽管由于海底电缆的更好屏蔽,鲨鱼咬伤已变得较少见 [14])。人类也有责任,无论是由于意外配置错误 [15]、拾荒 [16] 还是蓄意破坏 [17]。
  • Across different cloud regions, round-trip times of up to several minutes have been observed at high percentiles [18, Table 3].Even within a single datacenter, packet delay of more than a minute can occur during a network topology reconfiguration, triggered by a problem during a software upgrade for a switch [19]. Thus, we have to assume that messages might be delayed arbitrarily.
    在不同的云区域,在高百分位数上观察到的往返时间长达几分钟 [18, 表 3]。即使在单个数据中心内,在网络拓扑重新配置期间也可能发生超过一分钟的包延迟,这由交换机软件升级期间的问题触发 [19]。因此,我们必须假设消息可能会被任意延迟。
  • Sometimes communications are partially interrupted, depending on who you’re talking to: for example, A and B can communicate, B and C can communicate, but A and C cannot [20,21].Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully [22]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.
    有时通信会部分中断,这取决于你与谁交谈:例如,A 和 B 可以通信,B 和 C 可以通信,但 A 和 C 不能 [20, 21]。其他令人惊讶的故障包括一个网络接口有时会丢弃所有传入数据包但成功发送传出数据包 [22]:仅仅因为一个网络链路在一个方向上工作,并不能保证它在相反方向上也工作。
  • Even a brief network interruption can have repercussions that last for much longer than the original issue [8,20,23].
    即使是短暂的网络中断,其影响也可能比原始问题持续更长时间 [8, 20, 23]。

Network partitions 网络分区#

When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or netsplit, but it is not fundamentally different from other kinds of network interruption. Network partitions are not related to sharding of a storage system, which is sometimes also called partitioning (see Chapter 7).
当网络的一部分因网络故障而与其他部分断开连接时,这种情况有时被称为网络分区或网络分裂,但它与其他类型的网络中断在本质上没有区别。网络分区与存储系统的分片无关,分片有时也被称为分区(见第 7 章)。

Even if network faults are rare in your environment, the fact that faults can occur means that your software needs to be able to handle them. Whenever any communication happens over a network, it may fail—there is no way around it.
即使您环境中网络故障很少发生,但故障确实可能发生的事实意味着您的软件需要能够处理它们。每当通过网络进行通信时,它都可能失败 —— 这是无法避免的。

If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests, even when the network recovers [24], or it could even delete all of your data [25]. If software is put in an unanticipated situation, it may do arbitrary unexpected things.
如果网络故障的错误处理没有定义和测试,可能会发生任意糟糕的事情:例如,即使网络恢复,集群也可能陷入死锁并永久无法处理请求 [24],甚至可能删除所有您的数据 [25]。如果软件被置于未预料到的情况中,它可能会做任意意想不到的事情。

Handling network faults doesn’t necessarily mean tolerating them: if your network is normally fairly reliable, a valid approach may be to simply show an error message to users while your network is experiencing problems. However, you do need to know how your software reacts to network problems and ensure that the system can recover from them.It may make sense to deliberately trigger network problems and test the system’s response (this is known as fault injection; see “Fault injection”).
处理网络故障并不一定意味着容忍它们:如果你的网络通常相当可靠,一个有效的方法可能是当网络出现问题时,向用户显示错误消息。然而,你需要知道你的软件如何应对网络问题,并确保系统能够从中恢复。故意触发网络问题并测试系统的响应是有意义的(这被称为故障注入;参见 “故障注入”)。

Detecting Faults 检测故障#

Many systems need to automatically detect faulty nodes. For example:
许多系统需要自动检测故障节点。例如:

  • A load balancer needs to stop sending requests to a node that is dead (i.e., take it out of rotation).
    负载均衡器需要停止向一个已死(即从轮询中移除)的节点发送请求。
  • In a distributed database with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader (see “Handling Node Outages”).
    在单主复制分布式数据库中,如果主节点失败,需要将其中一个从节点提升为新的主节点(参见 “处理节点故障”)。

Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is working or not. In some specific circumstances you might get some feedback to explicitly tell you that something is not working:
不幸的是,网络的不确定性使得难以判断一个节点是否正常工作。在某些特定情况下,你可能会得到一些反馈来明确告诉你某个东西出了问题:

  • If you can reach the machine on which the node should be running, but no process is listening on the destination port (e.g., because the process crashed), the operating system will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply.
    如果你可以连接到节点应该运行的机器,但目标端口上没有进程监听(例如因为进程崩溃了),操作系统会友好地关闭或拒绝 TCP 连接,通过发送 RSTFIN 数据包来响应。
  • If a node process crashed (or was killed by an administrator) but the node’s operating system is still running, a script can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. For example, HBase does this [26].
    如果一个节点进程崩溃(或被管理员终止),但该节点的操作系统仍在运行,脚本可以通知其他节点关于崩溃的情况,以便另一个节点能够快速接管,而无需等待超时过期。例如,HBase 就采用了这种方式 [26]。
  • If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level (e.g., if the remote machine is powered down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared datacenter with no access to the switches themselves, or if you can’t reach the management interface due to a network problem.
    如果你可以访问数据中心网络交换机的管理界面,你可以查询它们来检测硬件级别的链路故障(例如,如果远程机器已关闭电源)。如果你通过互联网连接,或者你位于一个无法访问交换机的共享数据中心,或者由于网络问题无法访问管理界面,那么这个选项将不可行。
  • If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic failure detection capability either—it is subject to the same limitations as other participants of the network.
    如果一个路由器确定你试图连接的 IP 地址无法到达,它可能会向你发送一个 ICMP 目标不可达数据包。然而,路由器也没有神奇故障检测能力 —— 它同样受到网络其他参与者的相同限制。

Rapid feedback about a remote node being down is useful, but you can’t count on it. If something has gone wrong, you may get an error response at some level of the stack, but in general you have to assume that you will get no response at all. You can retry a few times, wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.
快速反馈远程节点宕机的情况很有用,但你不能依赖它。如果出了问题,你可能在堆栈的某个层级收到错误响应,但通常你必须假设你将完全得不到响应。你可以重试几次,等待超时时间过去,如果在超时时间内没有收到回复,最终将节点判定为死亡。

Timeouts and Unbounded Delays 超时和无限延迟#

If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There is unfortunately no simple answer.
如果超时是唯一能确定检测故障的方法,那么超时应该设置多长?不幸的是,没有简单的答案。

A long timeout means a long wait until a node is declared dead (and during this time, users may have to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due to a load spike on the node or the network).
长超时意味着节点被判定为死亡需要等待很长时间(在这段时间内,用户可能需要等待或看到错误消息)。短超时能更快地检测到故障,但存在更高的风险,即在节点实际上只是暂时减速(例如由于节点负载激增或网络问题)时错误地判定节点为死亡。

Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of performing some action (for example, sending an email), and another node takes over, the action may end up being performed twice. We will discuss this issue in more detail in “Knowledge, Truth, and Lies”, and in Chapters 10 and 12.
过早地声明一个节点死亡是有问题的:如果该节点实际上还活着并且正在执行某个操作(例如,发送电子邮件),而另一个节点接手,那么该操作可能会被执行两次。我们将在 “知识、真相与谎言” 以及第 10 章和第 12 章中更详细地讨论这个问题。

When a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working—see “When an overloaded system won’t recover”).
当一个节点被声明死亡时,它的职责需要转移到其他节点,这会给其他节点和网络带来额外的负载。如果系统已经因高负载而难以应对,过早地声明节点死亡可能会使问题变得更糟。特别是,该节点实际上可能并没有死亡,只是由于过载而响应缓慢;将其负载转移到其他节点可能会导致级联故障(在极端情况下,所有节点都声明彼此死亡,并且一切停止工作—— 参见 “当过载系统无法恢复时”)。

Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet is either delivered within some time d, or it is lost, but delivery never takes longer than d. Furthermore, assume that you can guarantee that a non-failed node always handles a request within some time r. In this case, you could guarantee that every successful request receives a response within time 2 d  +  r —and if you don’t receive a response within that time, you know that either the network or the remote node is not working. If this was true, 2 d  +  r would be a reasonable timeout to use.
想象一个虚构的系统,其网络保证数据包的最大延迟 —— 每个数据包要么在某个时间 d 内送达,要么丢失,但送达时间永远不会超过 d。此外,假设你可以保证一个非故障节点总能在某个时间 r 内处理请求。在这种情况下,你可以保证每个成功的请求都能在时间 2d + r 内收到响应 —— 如果你在这个时间内没有收到响应,就说明网络或远程节点出了问题。如果这是真的,2d + r 将是一个合理的超时时间。

Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays (that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive), and most server implementations cannot guarantee that they can handle requests within some maximum time (see “Response time guarantees”). For failure detection, it’s not sufficient for the system to be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip times to throw the system off-balance.
不幸的是,我们大多数系统都不具备上述任何一种保证:异步网络具有无界延迟(也就是说,它们尽可能快地传送数据包,但数据包到达的时间没有上限),而且大多数服务器实现无法保证在最大时间内处理请求(参见 “响应时间保证”)。对于故障检测,系统大部分时间快是不够的:如果你的超时时间很低,短暂的全程时间波动就足以让系统失衡。

Network congestion and queueing 网络拥塞和排队#

When driving a car, travel times on road networks often vary most due to traffic congestion. Similarly, the variability of packet delays on computer networks is most often due to queueing [27]:
开车时,由于交通拥堵,道路网络上的出行时间往往变化最大。类似地,计算机网络上数据包延迟的变化通常也是由于排队 [27]:

  • If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one (as illustrated in Figure 9-2). On a busy network link, a packet may have to wait a while until it can get a slot (this is called network congestion). If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent—even though the network is functioning fine.
    如果多个不同的节点同时尝试将数据包发送到同一个目的地,网络交换机必须将它们排队,并逐个将它们输入到目标网络链路(如图 9-2 所示)。在繁忙的网络链路上,一个数据包可能必须等待一段时间才能获得一个时隙(这被称为网络拥塞)。如果传入的数据量太大,导致交换机队列已满,数据包就会被丢弃,因此需要重新发送 —— 即使网络运行正常。
  • When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time [28].
    当数据包到达目标计算机时,如果所有 CPU 核心目前都忙,操作系统会将来自网络的传入请求排队,直到应用程序准备好处理它。根据机器的负载,这可能需要任意长的时间 [28]。
  • In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered) by the virtual machine monitor [29], further increasing the variability of network delays.
    在虚拟化环境中,当另一个虚拟机使用 CPU 核心时,正在运行的操作系统通常会被暂停几十毫秒。在这段时间内,虚拟机无法从网络中获取任何数据,因此虚拟机监视器 [29] 会将传入数据排队(缓冲),这进一步增加了网络延迟的变异性。
  • As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it sends data. This means additional queueing at the sender before the data even enters the network.
    如前所述,为了防止网络过载,TCP 限制了其发送数据的速率。这意味着数据在进入网络之前,发送端需要额外的排队。
    ddia 0902

Moreover, when TCP detects and automatically retransmits a lost packet, although the application does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to expire, and then waiting for the retransmitted packet to be acknowledged).
此外,当 TCP 检测到并自动重传丢失的数据包时,尽管应用程序直接看不到数据包丢失,但它确实会看到由此产生的延迟(等待超时过期,然后等待重传的数据包被确认)。

All of these factors contribute to the variability of network delays. Queueing delays have an especially wide range when a system is close to its maximum capacity: a system with plenty of spare capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very quickly.
所有这些因素都导致了网络延迟的变异性。当系统接近其最大容量时,排队延迟的范围尤其宽:一个拥有大量备用容量的系统可以轻易地清空队列,而在高度利用的系统里,长队列可以非常快地形成。

In public clouds and multitenant datacenters, resources are shared among many customers: the network links and switches, and even each machine’s network interface and CPUs (when running on virtual machines), are shared. Processing large amounts of data can use the entire capacity of network links (saturate them). As you have no control over or insight into other customers’ usage of the shared resources, network delays can be highly variable if someone near you (a noisy neighbor) is using a lot of resources [30,31].
在公共云和多租户数据中心中,资源是在许多客户之间共享的:网络链路和交换机,甚至每台机器的网络接口和 CPU(在虚拟机中运行时)都是共享的。处理大量数据可能会使用整个网络链路的容量(使其饱和)。由于你无法控制或了解其他客户对共享资源的使用情况,如果附近有某个客户(一个吵闹的邻居)使用了大量资源,网络延迟可能会非常高 [30, 31]。

In such environments, you can only choose timeouts experimentally: measure the distribution of network round-trip times over an extended period, and over many machines, to determine the expected variability of delays. Then, taking into account your application’s characteristics, you can determine an appropriate trade-off between failure detection delay and risk of premature timeouts.
在这样的环境中,你只能通过实验来选择超时时间:测量网络往返时间的分布情况,并在多台机器上进行,以确定延迟的预期变化范围。然后,根据你的应用特性,你可以确定在故障检测延迟和过早超时风险之间的适当权衡。

Even better, rather than using configured constant timeouts, systems can continually measure response times and their variability (jitter), and automatically adjust timeouts according to the observed response time distribution. The Phi Accrual failure detector [32], which is used for example in Akka and Cassandra [33] is one way of doing this. TCP retransmission timeouts also work similarly [5].
更好的做法是,系统可以持续测量响应时间和其变化(抖动),并根据观察到的响应时间分布自动调整超时时间。Phi Accrual 故障检测器 [32],例如在 Akka 和 Cassandra [33] 中使用,是解决这个问题的一种方法。TCP 重传超时也类似地工作 [5]。

Synchronous Versus Asynchronous Networks 同步网络与异步网络#

Distributed systems would be a lot simpler if we could rely on the network to deliver packets with some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level and make the network reliable so that the software doesn’t need to worry about it?
如果我们可以依赖网络以固定的最大延迟交付数据包,并且不会丢失数据包,那么分布式系统将简单得多。为什么我们不能在硬件层面解决这个问题,使网络变得可靠,这样软件就不需要担心这些问题?

To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have similar reliability and predictability in computer networks?
要回答这个问题,有趣的是将数据中心网络与传统固定电话网络(非蜂窝、非 VoIP)进行比较,后者极其可靠:延迟的音频帧和掉线电话非常罕见。一个电话呼叫需要持续的低端到端延迟和足够的带宽来传输你的语音音频样本。难道在计算机网络中拥有类似的可靠性和可预测性不好吗?

When you make a call over the telephone network, it establishes a circuit: a fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends [34]. For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every 250 microseconds [35].
当你通过电话网络打电话时,它会建立一个电路:为通话在整个两个通话者之间的路线上分配固定、保证的带宽。这个电路会一直保持到通话结束 [34]。例如,ISDN 网络以每秒 4,000 帧的固定速率运行。当通话建立时,每个帧(双向)内会分配 16 位空间。因此,在通话期间,每一方都保证每隔 250 微秒能够发送正好 16 位的音频数据 [35]。

This kind of network is synchronous: even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the next hop of the network. And because there is no queueing, the maximum end-to-end latency of the network is fixed. We call this a bounded delay.
这种网络是同步的:即使数据通过多个路由器,也不会出现排队现象,因为呼叫所需的 16 位空间已经在网络下一跳中预留好了。由于没有排队现象,网络的最大端到端延迟是固定的。我们称之为有界延迟。

Can we not simply make network delays predictable? 难道我们不能简单地使网络延迟变得可预测吗?#

Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t use any bandwidth (except perhaps for an occasional keepalive packet).
请注意,电话网络中的电路与 TCP 连接非常不同:电路是在建立时预留的一定量的带宽,在电路建立期间其他人无法使用,而 TCP 连接的报文会机会性地使用任何可用的网络带宽。你可以给 TCP 一个可变大小的数据块(例如,一封电子邮件或一个网页),它会尽可能快地传输它。当 TCP 连接处于空闲状态时,它不会使用任何带宽(除了偶尔的保持活动数据包)。

If datacenter networks and the internet were circuit-switched networks, it would be possible to establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not: Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit.
如果数据中心网络和互联网是电路交换网络,那么在电路建立时就可以保证一个最大往返时间。然而它们不是:以太网和 IP 是包交换协议,会遭受排队现象,因此在网络中存在无界的延迟。这些协议没有电路的概念。

Why do datacenter networks and the internet use packet switching? The answer is that they are optimized for bursty traffic. A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement—we just want it to complete as quickly as possible.
为什么数据中心网络和互联网使用包交换?答案是它们针对突发流量进行了优化。电路适合音频或视频通话,通话期间需要以相对恒定的比特数每秒进行传输。另一方面,请求网页、发送电子邮件或传输文件并没有特定的带宽要求 —— 我们只是希望它尽可能快地完成。

If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity.
如果你想在电路中传输文件,就必须猜测带宽分配。如果你猜测得太低,传输会不必要地慢,导致网络容量未被充分利用。如果你猜测得太高,电路将无法建立(因为如果网络无法保证带宽分配,它将不允许创建电路)。因此,使用电路进行突发数据传输会浪费网络容量,并使传输变得不必要地慢。相比之下,TCP 会动态地根据可用的网络容量调整数据传输速率。

There have been some attempts to build hybrid networks that support both circuit switching and packet switching. Asynchronous Transfer Mode (ATM) was a competitor to Ethernet in the 1980s, but it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities [36]: it implements end-to-end flow control at the link layer, which reduces the need for queueing in the network, although it can still suffer from delays due to link congestion [37]. With careful use of quality of service (QoS, prioritization and scheduling of packets) and admission control (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay [27,34]. New network algorithms like Low Latency, Low Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control problems both at the client and router level. Linux’s traffic controller (TC) also allows applications to reprioritize packets for QoS purposes.
曾有一些尝试构建支持电路交换和分组交换的混合网络。异步传输模式(ATM)在 20 世纪 80 年代是以太网的竞争对手,但它除了在电话网络核心交换机外并未获得广泛采用。InfiniBand 有一些相似之处 [36]:它在链路层实现端到端流量控制,这减少了网络中队列的需求,尽管它仍然可能因链路拥塞而延迟 [37]。通过仔细使用服务质量(QoS,数据包的优先级和调度)和准入控制(速率限制发送者),可以在分组网络上模拟电路交换,或提供统计上受限的延迟 [27,34]。新的网络算法如低延迟、低丢包和可扩展吞吐量(L4S)试图缓解客户端和路由器级别的排队和拥塞控制问题。Linux 的流量控制器(TC)也允许应用程序为 QoS 目的重新优先级排序数据包。

However, such quality of service is currently not enabled in multitenant datacenters and public clouds, or when communicating via the internet. Currently deployed technology does not allow us to make any guarantees about delays or reliability of the network: we have to assume that network congestion, queueing, and unbounded delays will happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.
然而,目前多租户数据中心和公共云中,或通过互联网通信时,这种服务质量尚未启用。当前部署的技术不允许我们对延迟或网络的可靠性做出任何保证:我们必须假设网络拥塞、排队和无界延迟将会发生。因此,没有 “正确” 的超时值 —— 它们需要通过实验来确定。

Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of networks, not individual connections between hosts, and at a much longer timescale.
互联网服务提供商之间的对等协议和通过边界网关协议(BGP)建立的路由,与电路交换比 IP 本身更相似。在这个层面上,可以购买专用带宽。然而,互联网路由在网络层面运作,而不是主机之间的单个连接层面,并且时间尺度更长。

Unreliable Clocks 不可靠的时钟#

Clocks and time are important. Applications depend on clocks in various ways to answer questions like the following:
时钟和时间很重要。应用程序以各种方式依赖时钟来回答以下问题:

  1. Has this request timed out yet?
    这个请求是否已经超时了?
  2. What’s the 99th percentile response time of this service?
    这个服务的 99 分位响应时间是多少?
  3. How many queries per second did this service handle on average in the last five minutes?
    这个服务在过去五分钟内平均每秒处理了多少个查询?
  4. How long did the user spend on our site?
    用户在我们的网站上花费了多长时间?
  5. When was this article published?
    这篇文章是什么时候发表的?
  6. At what date and time should the reminder email be sent?
    提醒邮件应该在什么日期和时间发送?
  7. When does this cache entry expire?
    这个缓存条目什么时候过期?
  8. What is the timestamp on this error message in the log file?
    日志文件中这个错误消息的时间戳是什么?

Examples 1–4 measure durations (e.g., the time interval between a request being sent and a response being received), whereas examples 5–8 describe points in time (events that occur on a particular date, at a particular time).
示例 1-4 测量持续时间(例如,请求发送和响应接收之间的时间间隔),而示例 5-8 描述的是时间点(在特定日期、特定时间发生的事件)。

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is sent, but due to variable delays in the network, we don’t know how much later. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved.
在分布式系统中,时间是件棘手的事情,因为通信不是瞬时的:消息在网络中从一台机器传送到另一台机器需要时间。消息接收的时间总是比发送时间晚,但由于网络延迟的变化,我们不知道晚多少。这个事实有时使得在涉及多台机器时难以确定事件发生的顺序。

Moreover, each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers [39]. The servers in turn get their time from a more accurate time source, such as a GPS receiver.
此外,网络中的每台机器都有自己的时钟,这是一个实际的硬件设备:通常是石英晶体振荡器。这些设备并不完全精确,因此每台机器都有自己的时间概念,可能比其他机器稍快或稍慢。可以在一定程度上同步时钟:最常用的机制是网络时间协议(NTP),它允许计算机时钟根据一组服务器报告的时间进行调整 [39]。这些服务器反过来又从更准确的时间源(如 GPS 接收器)获取时间。

Monotonic Versus Time-of-Day Clocks 单调时钟与日期时间时钟#

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock. Although they both measure time, it is important to distinguish the two, since they serve different purposes.
现代计算机至少有两种不同类型的时钟:日期时间时钟和单调时钟。尽管它们都测量时间,但区分两者很重要,因为它们服务于不同的目的。

Time-of-day clocks 日期时间时钟#

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar (also known as wall-clock time). For example,clock_gettime(CLOCK_REALTIME) on Linux and System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds. Some systems use other dates as their reference point. (Although the Linux clock is called real-time, it has nothing to do with real-time operating systems, as discussed in “Response time guarantees”.)
一个时钟时间(time-of-day)时钟做着你直观上期望时钟做的事:它根据某个日历(也称为墙上时钟时间)返回当前日期和时间。例如,Linux 中的 clock_gettime(CLOCK_REALTIME) 和 Java 中的 System.currentTimeMillis() 返回自纪元以来的秒数(或毫秒数):根据格里高利历,1970 年 1 月 1 日午夜 UTC,不计闰秒。有些系统使用其他日期作为参考点。(尽管 Linux 时钟被称为实时时钟,但它与实时操作系统无关,如 “响应时间保证” 中所述。)

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine (ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have various oddities, as described in the next section. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks unsuitable for measuring elapsed time [40].
时钟时间(time-of-day)时钟通常与 NTP 同步,这意味着一台机器上的时间戳(理想情况下)与另一台机器上的时间戳相同。然而,时钟时间(time-of-day)时钟也有各种古怪之处,如下一节所述。特别是,如果本地时钟比 NTP 服务器快太多,它可能会被强制重置,并看似跳回到之前的时间点。这些跳跃,以及由闰秒引起的类似跳跃,使得时钟时间(time-of-day)时钟不适合测量经过的时间 [40]。

Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST); these can be avoided by always using UTC as time zone, which does not have DST. Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward in steps of 10 ms on older Windows systems [41]. On recent systems, this is less of a problem.
一天中的时钟可能由于夏令时的开始和结束而出现跳跃;通过始终使用没有夏令时的协调世界时(UTC)作为时区,可以避免这种情况。一天中的时钟在历史上分辨率也相当粗糙,例如在较旧的 Windows 系统上以 10 毫秒的步长向前移动 [41]。在最近的系统上,这不是一个大问题。

Monotonic clocks 单调时钟#

A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a service’s response time: clock_gettime(CLOCK_MONOTONIC) or clock_gettime(CLOCK_BOOTTIME) on Linux [42] and System.nanoTime() in Java are monotonic clocks, for example. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).
单调时钟适用于测量持续时间(时间间隔),例如超时或服务的响应时间:Linux 上的 clock_gettime(CLOCK_MONOTONIC)clock_gettime(CLOCK_BOOTTIME) 以及 Java 中的 System.nanoTime() 就是单调时钟的例子。这个名字来源于它们保证总是向前移动(而一天中的时钟可能会倒退时间)。

You can check the value of the monotonic clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks — more like a stopwatch than a wall clock. However, the absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was booted up, or something similarly arbitrary. In particular, it makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.
你可以在某个时间点检查单调时钟的值,做些事情,然后再在稍后的时间点再次检查时钟。这两个值之间的差值告诉你两次检查之间经过的时间 —— 更像是秒表而不是挂钟。然而,时钟的绝对值是无意义的:它可能是计算机启动以来的纳秒数,或者类似地任意。特别是,比较两个不同计算机的单调时钟值没有意义,因为它们代表不同的含义。

On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not necessarily synchronized with other CPUs [43]. Operating systems compensate for any discrepancy and try to present a monotonic view of the clock to application threads, even as they are scheduled across different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [44].
在具有多个 CPU 插槽的服务器上,每个 CPU 可能有一个单独的计时器,这不一定与其他 CPU 同步 [43]。操作系统会弥补任何差异,并尝试向应用程序线程呈现单调的时钟视图,即使它们在不同的 CPU 上被调度。然而,对于这种单调性的保证,最好持保留态度 [44]。

NTP may adjust the frequency at which the monotonic clock moves forward (this is known as slewing the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic clocks is usually quite good: on most systems they can measure time intervals in microseconds or less.
NTP 可能会调整单调时钟前进的频率(这被称为拨动时钟),如果它检测到计算机的本地石英钟比 NTP 服务器移动得更快或更慢。默认情况下,NTP 允许时钟速率加快或减慢高达 0.05%,但 NTP 不能使单调时钟向前或向后跳跃。单调时钟的分辨率通常相当好:在大多数系统上,它们可以测量微秒级或更短的时间间隔。

In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.
在分布式系统中,使用单调时钟来测量经过的时间(例如,超时)通常是没问题的,因为它不假设不同节点时钟之间的任何同步,并且对测量的微小误差不敏感。

Clock Synchronization and Accuracy 时钟同步和精度#

Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful. Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:
单调时钟无需同步,但日期时钟需要根据 NTP 服务器或其他外部时间源进行设置才能使用。不幸的是,我们让时钟显示正确时间的方法并不可靠或精确 —— 硬件时钟和 NTP 都可能不稳定。仅举几例:

  • The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of up to 200 ppm (parts per million) for its servers [45], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly.
    计算机中的石英钟不够准确:它会漂移(运行速度比应有的快或慢)。时钟漂移程度取决于机器的温度。谷歌假设其服务器时钟漂移高达 200 ppm(百万分之几)[45],相当于每 30 秒与服务器重新同步的时钟漂移 6 毫秒,或每天重新同步一次的时钟漂移 17 秒。这种漂移限制了即使一切正常工作也能达到的最佳精度。
  • If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset [39]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
    如果计算机的时钟与 NTP 服务器差异过大,它可能会拒绝同步,或者本地时钟会被强制重置 [39]。任何在重置前后的时间观察应用程序可能会看到时间倒退或突然跳跃。
  • If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time, during which the drift may add up to large discrepancies between different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
    如果一个节点意外地被防火墙隔离,无法连接到 NTP 服务器,这种配置错误可能长时间未被察觉,期间时钟漂移可能导致不同节点之间的时间差异很大。有传闻证据表明,这种情况在实践中确实会发生。
  • NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [46], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely.
    NTP 同步的精度受限于网络延迟,因此在你处于网络拥塞且包延迟变化的情况下,其准确性有限。一个实验表明,通过互联网同步时可以实现 35 毫秒的最小误差 [46],尽管偶尔的网络延迟峰值会导致大约一秒的误差。根据配置,较大的网络延迟可能导致 NTP 客户端完全放弃同步。
  • Some NTP servers are wrong or misconfigured, reporting time that is off by hours [47,48]. NTP clients mitigate such errors by querying several servers and ignoring outliers. Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.
    一些 NTP 服务器存在错误或配置不当,报告的时间可能偏差数小时 [47, 48]。NTP 客户端通过查询多个服务器并忽略异常值来缓解此类错误。然而,依赖一个陌生人通过互联网告知的时间来确保系统正确性,这多少有些令人担忧。
  • Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions in systems that are not designed with leap seconds in mind [49].The fact that leap seconds have crashed many large systems [40,50] shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second adjustment gradually over the course of a day (this is known as smearing) [51,52], although actual NTP server behavior varies in practice [53]. Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
    闰秒会导致一分钟有 59 秒或 61 秒长,这会打乱那些没有考虑闰秒的系统中的时间假设 [49]。闰秒导致许多大型系统崩溃的事实 [40,50] 表明,关于时钟的错误假设很容易悄悄进入系统。处理闰秒的最佳方法可能是让 NTP 服务器 “撒谎”,通过在一天内逐渐进行闰秒调整(这被称为平滑处理)[51,52],尽管实际的 NTP 服务器行为在实践中有所不同 [53]。从 2035 年起将不再使用闰秒,因此这个问题幸运地将消失。
  • In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [54]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [29]. If a VM pauses for several seconds, the clock may then be several seconds behind the actual time, but NTP may continue to report that the clock is almost perfectly in sync [55].
    在虚拟机中,硬件时钟被虚拟化,这给需要精确计时功能的应用程序带来了额外的挑战 [54]。当 CPU 核心在虚拟机之间共享时,一个虚拟机在另一个虚拟机运行时会暂停几十毫秒。从应用程序的角度来看,这种暂停表现为时钟突然跳跃 [29]。如果一个虚拟机暂停了几秒钟,时钟可能会比实际时间慢几秒钟,但 NTP 可能仍然报告时钟几乎完全同步 [55]。
  • If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you probably cannot trust the device’s hardware clock at all. Some users deliberately set their hardware clock to an incorrect date and time, for example to cheat in games [56]. As a result, the clock might be set to a time wildly in the past or the future.
    如果你在无法完全控制的设备上运行软件(例如移动设备或嵌入式设备),你可能完全无法信任设备的硬件时钟。一些用户故意将硬件时钟设置为不正确的日期和时间,例如为了在游戏中作弊 [56]。因此,时钟可能被设置为过去或未来的某个离谱的时间。

It is possible to achieve very good clock accuracy if you care about it sufficiently to invest significant resources. For example, the MiFID II European regulation for financial institutions requires all high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help detect market manipulation [57].
如果你足够重视并投入足够资源,可以实现非常高的时钟精度。例如,金融机构的 MiFID II 欧洲法规要求所有高频交易基金将其时钟与 UTC 同步到 100 微秒以内,以帮助调试市场异常(如 “闪崩”)并帮助检测市场操纵 [57]。

Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the Precision Time Protocol (PTP) and careful deployment and monitoring [58,59]. Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this happens frequently, e.g. close to military facilities [60]. Some cloud providers have begun offering high-accuracy clock synchronization for their virtual machines [61]. However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
这种精度可以通过一些特殊硬件(GPS 接收器和 / 或原子钟)、精密时间协议(PTP)以及仔细的部署和监控来实现 [58,59]。仅依赖 GPS 存在风险,因为 GPS 信号很容易被干扰。在某些地区这种情况经常发生,例如靠近军事设施的地方 [60]。一些云服务提供商已经开始为其虚拟机提供高精度时钟同步服务 [61]。然而,时钟同步仍然需要非常小心。如果你的 NTP 守护进程配置错误,或者防火墙阻止了 NTP 流量,由于漂移导致的时钟误差会很快变得很大。

Relying on Synchronized Clocks 依赖同步时钟#

The problem with clocks is that while they seem simple and easy to use, they have a surprising number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward in time, and the time according to one node’s clock may be quite different from another node’s clock.
时钟的问题在于,尽管它们看起来简单易用,但存在许多意想不到的陷阱:一天可能不正好有 86,400 秒,时钟可能向后移动,一个节点的时钟时间可能与另一个节点的时钟时间差异很大。

Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though networks are well behaved most of the time, software must be designed on the assumption that the network will occasionally be faulty, and the software must handle such faults gracefully. The same is true with clocks: although they work quite well most of the time, robust software needs to be prepared to deal with incorrect clocks.
在本章前面,我们讨论了网络丢包和任意延迟数据包的问题。尽管网络大多数时候表现良好,但软件必须基于网络偶尔会出故障的假设进行设计,并且软件必须优雅地处理这些故障。时钟也是如此:尽管它们大多数时候工作得很好,但健壮的软件需要准备好处理不正确的时钟。

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash [62,63].
部分问题在于不准确的时钟很容易被忽视。如果一台机器的 CPU 有缺陷或网络配置错误,它很可能完全无法工作,因此会很快被发现并修复。另一方面,如果它的石英钟有缺陷或 NTP 客户端配置错误,大多数事情看起来都会正常进行,尽管它的时钟逐渐与现实越来越远。如果某个软件依赖于精确同步的时钟,结果更有可能是无声而微妙的数据丢失,而不是剧烈的崩溃 [62, 63]。

Thus, if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clocks before they can cause too much damage.
因此,如果你使用需要同步时钟的软件,那么仔细监控所有机器之间的时钟偏移是至关重要的。任何时钟与其他机器漂移过远的节点都应被宣布为死亡并从集群中移除。这种监控确保你在时钟造成过多损害之前就能发现它们。

Timestamps for ordering events 事件排序的时间戳#

Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes [64]. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one?
让我们考虑一个特定的情况,在这个情况下依赖时钟既有诱惑性又危险:跨多个节点的事件排序 [64]。例如,如果两个客户端向分布式数据库写入,谁先到?哪个写入是更近的?

Figure 9-3 illustrates a dangerous use of time-of-day clocks in a database with multi-leader replication (the example is similar to Figure 6-8). Client A writes x  = 1 on node 1; the write is replicated to node 3; client B increments x on node 3 (we now have x = 2); and finally, both writes are replicated to node 2.
图 9-3 展示了一个数据库在多主复制(该示例与图 6-8 类似)中危险使用时间戳的例子。客户端 A 在节点 1 上写入 x = 1;该写入被复制到节点 3;客户端 B 在节点 3 上递增 x(现在 x = 2);最后,这两个写入都被复制到节点 2。

In Figure 9-3, when a write is replicated to other nodes, it is tagged with a timestamp according to the time-of-day clock on the node where the write originated. The clock synchronization is very good in this example: the skew between node 1 and node 3 is less than 3 ms, which is probably better than you can expect in practice.
在图 9-3 中,当写入被复制到其他节点时,它会被标记上源节点的时间戳。在这个示例中,时钟同步非常好:节点 1 和节点 3 之间的偏差小于 3 毫秒,这可能是实际应用中你能期望的最好情况。

Since the increment builds upon the earlier write of x  = 1, we might expect that the write of x  = 2 should have the greater timestamp of the two. Unfortunately, that is not what happens in Figure 9-3: the write x  = 1 has a timestamp of 42.004 seconds, but the write x = 2 has a timestamp of 42.003 seconds.
由于自增操作是基于先前写入的 x = 1 构建的,我们可能会预期 x = 2 的写入应该具有更大的时间戳。然而,不幸的是,在图 9-3 中并没有发生这种情况:x = 1 的写入时间戳为 42.004 秒,而 x = 2 的写入时间戳为 42.003 秒。

As discussed in “Last write wins (discarding concurrent writes)”, one way of resolving conflicts between concurrently written values on different nodes is last write wins (LWW), which means keeping the write with the greatest timestamp for a given key and discarding all writes with older timestamps. In the example of Figure 9-3, when node 2 receives these two events, it will incorrectly conclude that x  = 1 is the more recent value and drop the write x = 2, so the increment is lost.
正如在 “最后写入者胜出(丢弃并发写入)” 中讨论的,解决不同节点上并发写入值之间冲突的一种方法是最后写入者胜出(LWW),这意味着保留给定键具有最大时间戳的写入,并丢弃所有具有较旧时间戳的写入。在图 9-3 的示例中,当节点 2 接收到这两个事件时,它会错误地得出 x = 1 是更近的值的结论,并丢弃 x = 2 的写入,因此自增操作会丢失。

This problem can be prevented by ensuring that when a value is overwritten, the new value always has a higher timestamp than the overwritten value, even if that timestamp is ahead of the writer’s local clock. However, that incurs the cost of an additional read to find the greatest existing timestamp. Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round trip, and therefore they simply use the client clock’s timestamp along with a last write wins policy [62]. This approach has some serious problems:
这个问题可以通过确保在值被覆盖时,新值总是比被覆盖的值具有更高的时间戳来防止,即使这个时间戳在写入者的本地时钟之前。然而,这需要额外读取以找到现有的最大时间戳,从而产生成本。一些系统,包括 Cassandra 和 ScyllaDB,希望在单次往返中写入所有副本,因此它们简单地使用客户端时钟的时间戳以及最后写入者胜出的策略 [62]。这种方法存在一些严重问题:

  • Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [63,]. This scenario can cause arbitrary amounts of data to be silently dropped without any error being reported to the application.
    数据库写入可能会神秘消失:一个时钟滞后的节点无法覆盖一个时钟快速的节点先前写入的值,直到节点之间的时钟偏差过去 [63, 65]。这种情况可能导致任意数量的数据在没有任何错误报告给应用程序的情况下被无声地丢弃。
  • LWW cannot distinguish between writes that occurred sequentially in quick succession (in Figure 9-3, client B’s increment definitely occurs after client A’s write) and writes that were truly concurrent (neither writer was aware of the other). Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality (see “Detecting Concurrent Writes”).
    LWW 无法区分快速连续发生的写操作(在图 9-3 中,客户端 B 的增量确实发生在客户端 A 的写操作之后)和真正并发发生的写操作(两个写操作者都没有意识到对方)。为了防止因果关系被破坏,需要额外的因果关系跟踪机制,例如版本向量(参见 “检测并发写操作”)。
  • It is possible for two nodes to independently generate writes with the same timestamp, especially when the clock only has millisecond resolution. An additional tiebreaker value (which can simply be a large random number) is required to resolve such conflicts, but this approach can also lead to violations of causality [62].
    当时钟只有毫秒级分辨率时,两个节点可能独立地生成具有相同时间戳的写操作。需要额外的破平局值(可以简单地是一个大的随机数)来解决这种冲突,但这种方法也可能导致因果关系被破坏 [62]。

Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and discarding others, it’s important to be aware that the definition of “recent” depends on a local time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet arrived before it was sent, which is impossible.
因此,尽管通过保留最 “最新” 的值并丢弃其他值来解决冲突很有诱惑力,但重要的是要意识到 “最新” 的定义依赖于本地时钟,而该时钟很可能是不准确的。即使使用紧密的 NTP 同步时钟,你可能会在发送者时钟的 100 毫秒时间戳发送数据包,而它在接收者时钟的时间戳为 99 毫秒到达 —— 因此看起来数据包似乎在发送之前就到达了,这是不可能的。

Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur? Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip time, in addition to other sources of error such as quartz drift. To guarantee a correct ordering, you would need the clock error to be significantly lower than the network delay, which is not possible.
NTP 同步是否足够精确,以至于不会发生这种不正确的顺序?很可能不够,因为 NTP 的同步精度本身受网络往返时间以及其他误差源的限制,如石英漂移。要保证正确的顺序,你需要时钟误差显著低于网络延迟,这是不可能的。

So-called logical clocks [66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events (see “Detecting Concurrent Writes”). Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another). In contrast, time-of-day and monotonic clocks, which measure actual elapsed time, are also known as physical clocks. We’ll look at logical clocks in more detail in “ID Generators and Logical Clocks”.
所谓的逻辑时钟 [66],基于递增计数器而非振荡的石英晶体,是排序事件的更安全替代方案(参见 “检测并发写入”)。逻辑时钟不测量一天中的时间或经过的秒数,只测量事件之间的相对顺序(一个事件是否发生在另一个事件之前或之后)。相比之下,测量实际经过时间的日期时钟和单调时钟也被称为物理时钟。我们将在 “ID 生成器和逻辑时钟” 中更详细地探讨逻辑时钟。

Clock readings with a confidence interval 带置信区间的时钟读数#

You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with an NTP server on the local network every minute. With an NTP server on the public internet, the best possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over 100 ms when there is network congestion.
你或许能够以微秒甚至纳秒的分辨率读取一台机器的日时钟。但即便你能获得这种精细的测量,也不代表该值实际上达到了这种精度。事实上,它很可能没有 —— 正如之前提到的,一个不精确的石英钟的漂移可能很容易达到几毫秒,即使你每分钟都与本地网络上的 NTP 服务器同步。对于公共互联网上的 NTP 服务器,可能的最大精度可能只有几十毫秒,当网络拥塞时,误差可能轻易飙升至 100 毫秒以上。

Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [67]. If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
因此,将时钟读数视为一个时间点并不合理 —— 它更像是一个时间范围,在置信区间内:例如,系统可能有 95% 的把握认为当前时间在分钟后的 10.3 秒到 10.5 秒之间,但它无法更精确地知道这一点 [67]。如果我们只知道时间 ±100 毫秒,那么时间戳中的微秒位数基本上没有意义。

The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or atomic clock directly attached to your computer, the expected error range is determined by the device and, in the case of GPS, by the quality of the signal from the satellites. If you’re getting the time from a server, the uncertainty is based on the expected quartz drift since your last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to the server (to a first approximation, and assuming you trust the server).
不确定性界限可以根据您的时间源进行计算。如果您有直接连接到计算机的 GPS 接收器或原子钟,预期的误差范围由设备决定,在 GPS 的情况下,还由卫星信号的质量决定。如果您从服务器获取时间,不确定性基于自上次与服务器同步以来的预期石英漂移,加上 NTP 服务器的不确定性,再加上到服务器的网络往返时间(以第一近似值计算,并假设您信任该服务器)。

Unfortunately, most systems don’t expose this uncertainty: for example, when you call clock_gettime(), the return value doesn’t tell you the expected error of the timestamp, so you don’t know if its confidence interval is five milliseconds or five years.
不幸的是,大多数系统都不提供这种不确定性:例如,当您调用 clock_gettime() 时,返回值不会告诉您时间戳的预期误差,因此您不知道其置信区间是五毫秒还是五年。

There are exceptions: the TrueTime API in Google’s Spanner [45] and Amazon’s ClockBound explicitly report the confidence interval on the local clock. When you ask it for the current time, you get back two values: [*earliest*, *latest*], which are the earliest possible and the latest possible timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is somewhere within that interval. The width of the interval depends, among other things, on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.
有例外情况:谷歌的 Spanner 中的 TrueTime API [45] 和亚马逊的 ClockBound 明确报告本地时钟的置信区间。当你请求当前时间时,你会得到两个值: [*earliest*, *latest*] ,分别是可能最早的和最晚的时间戳。根据其不确定性计算,时钟知道实际当前时间位于该区间内。该区间的宽度取决于多种因素,包括本地石英时钟上次与更精确的时钟源同步以来所经过的时间。

Synchronized clocks for global snapshots 用于全局快照的同步时钟#

In “Snapshot Isolation and Repeatable Read” we discussed multi-version concurrency control (MVCC), which is a very useful feature in databases that need to support both small, fast read-write transactions and large, long-running read-only transactions (e.g., for backups or analytics). It allows read-only transactions to see a snapshot of the database, a consistent state at a particular point in time, without locking and interfering with read-write transactions.
在 “快照隔离和可重复读” 中,我们讨论了多版本并发控制(MVCC),这是需要同时支持小型快速读写事务和大型长时间运行的只读事务(例如备份或分析)的数据库中的一个非常有用的特性。它允许只读事务查看数据库的快照,即在特定时间点的数据库一致性状态,而无需锁定且不会干扰读写事务。

Generally, MVCC requires a monotonically increasing transaction ID. If a write happened later than the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for generating transaction IDs.
通常情况下,MVCC 需要一个单调递增的事务 ID。如果写入操作发生在快照之后(即写入操作的事务 ID 大于快照的事务 ID),那么该写入对快照事务是不可见的。在单节点数据库中,一个简单的计数器就足以生成事务 ID。

However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all shards) is difficult to generate, because it requires coordination. The transaction ID must reflect causality: if transaction B reads or overwrites a value that was previously written by transaction A, then B must have a higher transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid transactions, creating transaction IDs in a distributed system becomes an untenable bottleneck. (We will discuss such ID generators in “ID Generators and Logical Clocks”.)
然而,当数据库分布到多台机器上,甚至可能分布在多个数据中心时,生成一个全局单调递增的事务 ID(跨所有分片)变得非常困难,因为它需要协调。事务 ID 必须反映因果关系:如果事务 B 读取或覆盖了事务 A 之前写入的值,那么 B 的事务 ID 必须高于 A—— 否则,快照将不一致。在大量小而快的交易中,在分布式系统中创建事务 ID 会成为一个难以承受的瓶颈。(我们将在 “ID 生成器和逻辑时钟” 中讨论这类 ID 生成器。)

Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get the synchronization good enough, they would have the right properties: later transactions have a higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
我们能用同步的时钟时间戳作为事务 ID 吗?如果我们能足够好地同步它们,它们就会具有正确的属性:较晚的事务具有更高的时间戳。当然,问题是时钟准确性的不确定性。

Spanner implements snapshot isolation across datacenters in this way [68,69]. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp (A = [A earliest, A latest] and B = [B earliest, B latest]), and those two intervals do not overlap (i.e.,A earliest < A latest < B earliest < B latest), then B definitely happened after A—there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.
Spanner 通过这种方式在数据中心之间实现快照隔离 [68, 69]。它使用 TrueTime API 报告的时钟置信区间,并基于以下观察:如果你有两个置信区间,每个区间都由最早和最晚可能的时间戳组成(A = [A earliest, A latest ] 和 B = [B earliest, B latest ]),并且这两个区间不重叠(即 A earliest < A latest < B earliest < B latest ),那么 B 肯定发生在 A 之后 —— 毫无疑问。只有当区间重叠时,我们才不确定 A 和 B 发生的顺序。

In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [45].
为了确保事务时间戳反映因果关系,Spanner 特意等待置信区间的长度后才提交读写事务。通过这种方式,它确保任何可能读取数据的事务都在足够晚的时间,因此它们的置信区间不会重叠。为了将等待时间尽可能缩短,Spanner 需要将时钟不确定性尽可能减小;为此,Google 在每个数据中心部署了 GPS 接收器或原子钟,使时钟同步精度达到约 7 毫秒 [45]。

The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to have a confidence interval, and the accurate clock sources only help keep that interval small. Other systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound when running on AWS [70], and several other systems now also rely on clock synchronization to various degrees [71,72].
原子钟和 GPS 接收器在 Spanner 中并非绝对必要:重要的是要有置信区间,而精确的时钟源只是帮助保持该区间较小。其他系统也开始采用类似方法:例如,YugabyteDB 在运行于 AWS 时可以利用 ClockBound [70],还有其他几个系统现在也不同程度地依赖时钟同步 [71,72]。

Process Pauses 进程暂停#

Let’s consider another example of dangerous clock use in a distributed system. Say you have a database with a single leader per shard. Only the leader is allowed to accept writes. How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?
让我们再考虑一个分布式系统中危险时钟使用的例子。假设你有一个数据库,每个分片只有一个领导者。只有领导者被允许接受写入。节点如何知道它仍然是领导者(即没有被其他人宣布为死亡),并且可以安全地接受写入?

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout [73]. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.
一种选择是领导者从其他节点获取租约,这类似于一个带超时的锁 [73]。在任何时候,只有一个节点可以持有租约 —— 因此,当一个节点获取租约时,它知道在一段时间内它是领导者,直到租约过期。为了保持领导者地位,节点必须在租约过期前定期续订租约。如果节点失败,它将停止续订租约,因此当租约过期时,另一个节点可以接管。

You can imagine the request-handling loop looking something like this:
你可以想象请求处理循环看起来像这样:

What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the lease is set by a different machine (where the expiry may be calculated as the current time plus 30 seconds, for example), and it’s being compared to the local system clock. If the clocks are out of sync by more than a few seconds, this code will start doing strange things.
这段代码有什么问题?首先,它依赖于同步时钟:租约的过期时间由另一台机器设置(过期时间可能是当前时间加 30 秒,例如),然后与本地系统时钟进行比较。如果时钟不同步超过几秒钟,这段代码就会开始出现奇怪的行为。

Secondly, even if we change the protocol to only use the local monotonic clock, there is another problem: the code assumes that very little time passes between the point that it checks the time (System.currentTimeMillis()) and the time when the request is processed (process(request)). Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the lease doesn’t expire in the middle of processing a request.
其次,即使我们将协议改为仅使用本地单调时钟,还有一个问题:代码假设在检查时间( System.currentTimeMillis() )和处理请求的时间( process(request) )之间经过的时间非常少。通常这段代码运行得非常快,所以 10 秒的缓冲时间足够确保租约在请求处理过程中不会过期。

However, what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line lease.isValid() before finally continuing. In that case, it’s likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won’t notice that the lease has expired until the next iteration of the loop—by which time it may have already done something unsafe by processing the request.
然而,如果程序执行出现意外暂停怎么办?例如,假设线程在 lease.isValid() 行附近停止了 15 秒,然后才继续执行。在这种情况下,当请求被处理时,租约很可能已经过期,另一个节点已经接管了领导权。但是,没有任何机制告诉这个线程它暂停了这么长时间,因此这段代码直到循环的下次迭代才会注意到租约已过期 —— 而此时它可能已经通过处理请求做了不安全的事情。

Is it reasonable to assume that a thread might be paused for so long? Unfortunately yes. There are various reasons why this could happen:
假设线程会暂停这么长时间是合理的吗?不幸的是,是。这可能有多种原因:

  • Contention among threads accessing a shared resource, such as a lock or queue, can cause threads to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such problems worse, and contention problems can be difficult to diagnose [74].
    多个线程访问共享资源(如锁或队列)时的竞争可能导致线程花费大量时间等待。迁移到拥有更多 CPU 核心的机器可能会使这类问题更严重,而竞争问题可能难以诊断 [74]。
  • Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads. In the past, such “stop-the-world” GC pauses would sometimes last for several minutes [75]! With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see “Limiting the impact of garbage collection”).
    许多编程语言运行时(如 Java 虚拟机)都有一个垃圾收集器(GC),偶尔需要停止所有正在运行的线程。在过去,这种 “停止世界” 的 GC 暂停有时会持续几分钟 [75]!随着现代 GC 算法的发展,这个问题已经不那么严重了,但 GC 暂停仍然可能很明显(参见 “限制垃圾收集的影响”)。
  • In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution). This pause can occur at any time in a process’s execution and can last for an arbitrary length of time. This feature is sometimes used for live migration of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory [76].
    在虚拟化环境中,虚拟机可以被挂起(暂停所有进程并保存内存内容到磁盘)和恢复(恢复内存内容并继续执行)。这种暂停可以在进程执行中的任何时间发生,并且可以持续任意长度的时间。这个特性有时用于在不重启的情况下将虚拟机从一台主机迁移到另一台主机,在这种情况下,暂停的长度取决于进程写入内存的速度 [76]。
  • On end-user devices such as laptops and phones, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.
    在笔记本电脑和手机等终端用户设备上,执行也可能被任意挂起和恢复,例如当用户合上笔记本电脑的盖子时。
  • When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time. If the machine is under heavy load—i.e., if there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again.
    当操作系统上下文切换到另一个线程,或当虚拟机监视器切换到不同的虚拟机(在虚拟机中运行时),当前正在运行的线程可以在代码的任意点暂停。在虚拟机的情况下,在其他虚拟机中花费的 CPU 时间被称为窃取时间。如果机器负载很重 —— 即如果有大量线程等待运行 —— 暂停的线程可能需要一段时间才能再次运行。
  • If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete [77]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays [78]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays [31].
    如果应用程序执行同步磁盘访问,一个线程可能会暂停等待慢速磁盘 I/O 操作完成 [77]。在许多语言中,即使代码没有明确提到文件访问,磁盘访问也可能出乎意料地发生 —— 例如,Java 类加载器在第一次使用时懒加载类文件,这可能在程序执行的任何时间发生。I/O 暂停和 GC 暂停甚至可能合谋将它们的延迟结合起来 [78]。如果磁盘实际上是网络文件系统或网络块设备(例如亚马逊的 EBS),I/O 延迟还会受到网络延迟变化的影响 [31]。
  • If the operating system is configured to allow swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may in turn require a different page to be swapped out to disk. In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing). To avoid this problem, paging is often disabled on server machines (if you would rather kill a process to free up memory than risk thrashing).
    如果操作系统配置为允许交换到磁盘(分页),一个简单的内存访问可能会导致页面错误,需要将磁盘上的页面加载到内存中。在此缓慢的 I / O 操作进行期间,线程会被暂停。如果内存压力很大,这反过来又可能需要将另一个页面交换到磁盘上。在极端情况下,操作系统可能会将大部分时间都花在页面进出内存上,而很少完成实际工作(这被称为颠簸)。为了避免这个问题,服务器机器上通常会禁用分页(如果你宁愿杀死进程来释放内存,而不是冒险颠簸)。
  • A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell. This signal immediately stops the process from getting any more CPU cycles until it is resumed with SIGCONT, at which point it continues running where it left off. Even if your environment does not normally use SIGSTOP, it might be sent accidentally by an operations engineer.
    一个 Unix 进程可以通过发送 SIGSTOP 信号来暂停,例如在 shell 中按 Ctrl-Z。这个信号会立即停止进程获取任何更多的 CPU 周期,直到它用 SIGCONT 恢复时,才会继续运行它离开的地方。即使你的环境通常不使用 SIGSTOP ,它也可能被运维工程师意外发送。

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur.
所有这些情况都可能在任何时刻中断正在运行的线程,并在稍后的某个时间重新启动它,而线程本身甚至没有察觉到。这个问题与在单机上编写多线程代码使其线程安全类似:你不能对时间做出任何假设,因为可能会发生随机的上下文切换和并行操作。

When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems, because a distributed system has no shared memory—only messages sent over an unreliable network.
在单机上编写多线程代码时,我们拥有相当好的工具来使其线程安全:互斥锁、信号量、原子计数器、无锁数据结构、阻塞队列等等。不幸的是,这些工具不能直接应用于分布式系统,因为分布式系统没有共享内存 —— 只有通过不可靠网络发送的消息。

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding. Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.
分布式系统中的节点必须假设其执行可以在任何时刻暂停很长时间,即使在函数执行中间也可能发生。在暂停期间,世界其他部分仍在继续运行,甚至可能因为暂停的节点没有响应而宣布其死亡。最终,暂停的节点可能会继续运行,而它甚至没有察觉到自己曾经休眠,直到稍后检查时钟时才发现。

Response time guarantees 响应时间保证#

In many programming languages and operating systems, threads and processes may pause for an unbounded amount of time, as discussed. Those reasons for pausing can be eliminated if you try hard enough.
在许多编程语言和操作系统中,线程和进程可能会无限期地暂停,正如所讨论的。如果你足够努力,这些暂停的原因是可以消除的。

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system. These are so-called hard real-time systems.
某些软件运行在环境中,如果在指定时间内未能响应可能会导致严重损害:控制飞机、火箭、机器人、汽车和其他物理对象的计算机必须对其传感器输入做出快速且可预测的响应。在这些系统中,软件必须在指定的截止日期前做出响应;如果未能满足截止日期,可能会导致整个系统的故障。这些被称为硬实时系统。

Note 注意#

In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the term real-time on the web, where it describes servers pushing data to clients and stream processing without hard response time constraints (see Chapter 12).
在嵌入式系统中,实时意味着系统经过精心设计和测试,以确保在所有情况下都能满足指定的时序保证。这种含义与网络上对实时一词更模糊的使用形成对比,后者描述的是服务器向客户端推送数据以及无严格响应时间限制的流处理(见第 12 章)。

For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag release system.
例如,如果你的汽车车载传感器检测到你正在经历碰撞,你就不希望由于安全气囊释放系统中的不恰当 GC 暂停而导致安全气囊的释放被延迟。

Providing real-time guarantees in a system requires support from all levels of the software stack: a real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specified intervals is needed; library functions must document their worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely (real-time garbage collectors exist, but the application must still ensure that it doesn’t give the GC too much work to do); and an enormous amount of testing and measurement must be done to ensure that guarantees are being met.
在系统中提供实时保证需要软件栈所有层级的支持:需要一个实时操作系统(RTOS),它允许进程在指定的时间间隔内以保证的 CPU 时间分配进行调度;库函数必须记录其最坏情况执行时间;动态内存分配可能被限制或完全禁止(存在实时垃圾收集器,但应用程序仍需确保不会让 GC 做太多工作);并且必须进行大量的测试和测量以确保保证得到满足。

All of this requires a large amount of additional work and severely restricts the range of programming languages, libraries, and tools that can be used (since most languages and tools do not provide real-time guarantees). For these reasons, developing real-time systems is very expensive, and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above all else (see also “Latency and Resource Utilization”).
所有这些都需要大量的额外工作,并且严重限制了可以使用的编程语言、库和工具的范围(因为大多数语言和工具不提供实时保证)。由于这些原因,开发实时系统成本非常高,它们最常用于安全关键型嵌入式设备。此外,“实时” 并不等同于 “高性能”—— 事实上,实时系统可能具有较低的吞吐量,因为它们必须优先考虑及时响应(另见 “延迟和资源利用率”)。

For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.
对于大多数服务器端数据处理系统来说,实时保证既不经济也不合适。因此,这些系统必须忍受非实时环境带来的暂停和时钟不稳定性。

Limiting the impact of garbage collection 限制垃圾回收的影响#

Garbage collection used to be one of the biggest reasons for process pauses [79], but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of these is optimized for different memory profiles such as high-frequency object creation, large heaps, and so on. By contrast, Go offers a simpler concurrent mark sweep garbage collector that attempts to optimize itself.
垃圾回收曾经是导致进程暂停的主要原因之一 [79],但幸运的是,GC 算法已经取得了很大进步:经过适当调优的收集器现在通常暂停时间不会超过几毫秒。Java 运行时提供了并发标记清除(CMS)、垃圾优先(G1)、Z 垃圾收集器(ZGC)、Epsilon 和 Shenandoah 等收集器。这些收集器针对不同的内存特征进行了优化,例如高频对象创建、大型堆等。相比之下,Go 提供了一个更简单的并发标记清除垃圾收集器,试图自我优化。

If you need to avoid GC pauses entirely, one option is to use a language that doesn’t have a garbage collector at all. For example, Swift uses automatic reference counting to determine when memory can be freed; Rust and Mojo track lifetimes of objects using the type system so the compiler can determine how long memory must be allocated for.
如果你需要完全避免 GC 暂停,一个选择是使用一种根本没有垃圾回收器的语言。例如,Swift 使用自动引用计数来确定何时可以释放内存;Rust 和 Mojo 通过类型系统跟踪对象的生命周期,以便编译器可以确定需要为内存分配多长时间。

It’s also possible to use a garbage-collected language while mitigating the impact of pauses. One approach is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles of the response time [80,81].
也可以在缓解暂停影响的同时使用垃圾回收语言。一种方法是像节点短暂计划停机一样处理 GC 暂停,并在一个节点进行垃圾收集时让其他节点处理来自客户端的请求。如果运行时可以警告应用程序某个节点即将需要 GC 暂停,应用程序可以停止向该节点发送新请求,等待它完成处理未完成的请求,然后在没有请求进行时执行 GC。这个技巧将 GC 暂停对客户端隐藏,并减少了响应时间的高百分位数 [80, 81]。

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects [79,82]. One node can be restarted at a time, and traffic can be shifted away from the node before the planned restart, like in a rolling upgrade (see Chapter 5).
这个想法的一个变体是仅使用垃圾收集器处理短期对象(这些对象收集速度快),并定期重新启动进程,以防止它们积累足够多的长期对象而需要进行长期对象的完整 GC [79, 82]。一次可以重新启动一个节点,并在计划重启之前将流量转移到该节点之外,就像滚动升级一样(见第 5 章)。

These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their impact on the application.
这些措施不能完全防止垃圾收集暂停,但它们可以有效地减少对应用程序的影响。

Knowledge, Truth, and Lies 知识、真理和谎言#

So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.
到目前为止,在本章中我们探讨了分布式系统与在单台计算机上运行的程序的不同之处:没有共享内存,只有通过不可靠的网络进行消息传递,且存在可变延迟,系统可能遭受部分故障、不可靠的时钟和处理暂停。

The consequences of these issues are profoundly disorienting if you’re not used to distributed systems. A node in the network cannot know anything for sure about other nodes—it can only make guesses based on the messages it receives (or doesn’t receive). A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.
如果你不熟悉分布式系统,这些问题带来的后果会让人深感迷失。网络中的节点无法确切了解其他节点的情况 —— 它只能根据收到的(或未收到)的消息进行猜测。一个节点只能通过与另一个节点交换消息来了解其状态(它存储了哪些数据、是否正常运行等)。如果远程节点没有响应,就无法知道其状态,因为网络中的问题无法可靠地区分节点本身的问题。

Discussions of these systems border on the philosophical: What do we know to be true or false in our system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable [83]? Should software systems obey the laws that we expect of the physical world, such as cause and effect?
关于这些系统的讨论近乎哲学:在我们的系统中,我们了解哪些是真实的或虚假的?如果我们感知和测量的机制不可靠 [83],我们对此有多大的把握?软件系统是否应该遵循我们期望的物理世界的定律,例如因果关系?

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.
幸运的是,我们不必去探究生命的意义。在一个分布式系统中,我们可以陈述我们对行为(系统模型)的假设,并设计实际系统使其满足这些假设。算法可以在一定的系统模型内被证明是正确的。这意味着即使底层系统模型提供的保证非常少,可靠的行为也是可以实现的。

However, although it is possible to make software well behaved in an unreliable system model, it is not straightforward to do so. In the rest of this chapter we will further explore the notions of knowledge and truth in distributed systems, which will help us think about the kinds of assumptions we can make and the guarantees we may want to provide. In Chapter 10 we will proceed to look at some examples of distributed algorithms that provide particular guarantees under particular assumptions.
然而,尽管在不可靠的系统模型中可以使软件表现得很好,但这并不容易。在本章的其余部分,我们将进一步探讨分布式系统中的知识和真理的概念,这将帮助我们思考我们可以做出哪些假设以及我们可能希望提供哪些保证。在第 10 章,我们将继续探讨一些分布式算法的例子,这些算法在特定假设下提供特定的保证。

The Majority Rules 多数规则#

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed [22]. Even though that node is working perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the funeral procession continues with stoic determination.
想象一个存在非对称故障的网络:某个节点能够接收所有发往它的消息,但从此节点发出的任何消息都会被丢弃或延迟 [22]。尽管该节点运行正常,并且正在接收来自其他节点的请求,但其他节点无法收到它的响应。经过一段时间后,其他节点会宣布它宕机,因为它们没有收到该节点的消息。这种情况就像一场噩梦:半断开的节点被拖向墓地,一边挣扎一边喊叫 “我没有死!”—— 但由于没有人能听到它的喊叫,葬礼队伍依然坚定地继续前行。

In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it is sending are not being acknowledged by other nodes, and so realize that there must be a fault in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the semi-disconnected node cannot do anything about it.
在稍微不那么噩梦般的场景中,半断开的节点可能会注意到它发送的消息没有得到其他节点的确认,从而意识到网络中存在故障。然而,该节点会被其他节点错误地宣布为宕机,半断开的节点对此无能为力。

As a third scenario, imagine a node that pauses execution for one minute. During that time, no requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and eventually declare the node dead and load it onto the hearse. Finally, the pause finishes and the node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting with bystanders. At first, the paused node doesn’t even realize that an entire minute has passed and that it was declared dead—from its perspective, hardly any time has passed since it was last talking to the other nodes.
作为第三种场景,想象一个节点暂停执行一分钟。在这段时间里,没有处理请求,也没有发送响应。其他节点等待、重试、变得不耐烦,最终宣布该节点死亡并将其装载到灵车上。最后,暂停结束,该节点的线程继续运行,仿佛什么都没发生。其他节点惊讶地发现,那个所谓的死亡节点突然从棺材里抬起头,完好无损,并开始愉快地与旁观者聊天。起初,暂停的节点甚至没有意识到已经过去了一分钟,也没有意识到自己被宣布死亡 —— 从它的角度来看,自上次与其他节点交谈以来,几乎没过去多少时间。

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes (see “Quorums for reading and writing”): decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.
这些故事的寓意是,一个节点不能必然信任自己对情况的判断。分布式系统不能完全依赖单个节点,因为节点可能随时发生故障,可能会使系统卡住而无法恢复。相反,许多分布式算法依赖于多数票,即节点之间的投票(参见 “读写多数票”):决策需要多个节点中的一些最小票数,以减少对任何特定节点的依赖。

That includes decisions about declaring nodes dead. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.
这也包括宣布节点死亡的决定。如果一个节点的多数派宣布另一个节点死亡,那么这个节点就必须被视为死亡,即使该节点仍然感觉非常鲜活。单个节点必须遵守多数派的决定并下线。

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds of quorums are possible). A majority quorum allows the system to continue working if a minority of nodes are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be tolerated). However, it is still safe, because there can only be only one majority in the system—there cannot be two majorities with conflicting decisions at the same time. We will discuss the use of quorums in more detail when we get to consensus algorithms in Chapter 10.
最常见的是,多数票是超过一半节点的绝对多数(尽管可能存在其他类型的多数票)。多数票多数允许系统在少数节点出现故障时继续工作(如果有三个节点,可以容忍一个故障节点;如果有五个节点,可以容忍两个故障节点)。然而,它仍然是安全的,因为系统中只能有一个多数 —— 同时不可能存在两个具有冲突决策的多数。

Distributed Locks and Leases 分布式锁和租约#

Locks and leases in distributed application are prone to be misused, and a common source of bugs [84]. Let’s look at one particular case of how they can go wrong.
分布式应用中的锁和租约容易遭到误用,并且是常见的错误来源 [84]。让我们来看一个它们可能出错的特定案例。

In “Process Pauses” we saw that a lease is a kind of lock that times out and can be assigned to a new owner if the old owner stops responding (perhaps because it crashed, it paused for too long, or it was disconnected from the network). You can use leases in situations where a system requires there to be only one of some thing. For example:
在 “进程暂停” 中,我们看到租约是一种会超时的锁,如果旧所有者停止响应(可能因为它崩溃了,暂停时间过长,或者它断开了网络连接),它可以被分配给新的所有者。你可以在需要系统中只有一个特定事物的场景中使用租约。例如:

  • Only one node is allowed to be the leader for a database shard, to avoid split brain (see “Handling Node Outages”).
    只允许一个节点成为数据库分片的领导者,以避免出现脑裂(参见 “处理节点故障”)。
  • Only one transaction or client is allowed to update a particular resource or object, to prevent it being corrupted by concurrent writes.
    只允许一个事务或客户端更新特定的资源或对象,以防止它被并发写入损坏。
  • Only one node should process a given input file to a big processing job, to avoid wasted effort due to multiple nodes redundantly doing the same work.
    一个节点应该处理一个给定的输入文件到一个大的处理任务中,以避免由于多个节点重复做相同的工作而浪费精力。

It is worth thinking carefully about what happens if several nodes simultaneously believe that they hold the lease, perhaps due to a process pause. In the third example, the consequence is only some wasted computational resources, which is not a big deal. But in the first two cases, the consequence could be lost or corrupted data, which is much more serious.
值得仔细思考如果多个节点同时认为他们持有租约会发生什么,可能由于进程暂停。在第三个例子中,后果只是浪费了一些计算资源,这并不是什么大问题。但在前两个案例中,后果可能是数据丢失或损坏,这要严重得多。

For example, Figure 9-4 shows a data corruption bug due to an incorrect implementation of locking. (The bug is not theoretical: HBase used to have this problem [85,86].) Say you want to ensure that a file in a storage service can only be accessed by one client at a time, because if multiple clients tried to write to it, the file would become corrupted. You try to implement this by requiring a client to obtain a lease from a lock service before accessing the file. Such a lock service is often implemented using a consensus algorithm; we will discuss this further in Chapter 10.
例如,图 9-4 显示了一个由于锁实现不正确导致的数据损坏错误。(这个错误不是理论上的:HBase 曾经有过这个问题 [85,86]。)假设你想确保存储服务中的一个文件一次只能被一个客户端访问,因为如果多个客户端尝试写入它,文件就会损坏。你试图通过要求客户端在访问文件前从锁服务中获取租约来实现这一点。这样的锁服务通常使用共识算法实现;我们将在第 10 章中进一步讨论这一点。

ddia 0904

The problem is an example of what we discussed in “Process Pauses”: if the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a split brain situation: the clients’ writes clash and corrupt the file.
这个问题是我们讨论的 “进程暂停” 的一个例子:如果持有租约的客户端暂停时间过长,其租约会过期。另一个客户端可以获得同一文件的租约,并开始写入该文件。当暂停的客户端恢复时,它错误地认为仍然拥有有效的租约,并继续写入文件。我们现在处于脑裂状态:客户端的写入冲突并损坏了文件。

Figure 9-5 shows a different problem that has similar consequences. In this example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a write request to the storage service, but this request is delayed for a long time in the network. (Remember from “Network Faults in Practice” that packets can sometimes be delayed by a minute or more.) By the time the write request arrives at the storage service, the lease has already timed out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar to Figure 9-4.
图 9-5 展示了一个具有相似后果的不同问题。在这个例子中,没有进程暂停,只有客户端 1 崩溃。在客户端 1 崩溃之前,它向存储服务发送了一个写请求,但这个请求在网络中延迟了很长时间。(回想一下 “网络故障在实践中” 的内容,数据包有时会延迟一分钟或更长时间。)当写请求到达存储服务时,租约已经超时,允许客户端 2 获取它并发送自己的写操作。结果是类似于图 9-4 的损坏。

ddia 0905

Fencing off zombies and delayed requests 隔离僵尸和延迟请求#

The term zombie is sometimes used to describe a former leaseholder who has not yet found out that it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot rule out zombies entirely, we have to instead ensure that they can’t do any damage in the form of split brain. This is called fencing off the zombie.
术语 “僵尸” 有时用来描述一个尚未意识到自己失去租约的前任租约持有者,它仍然表现得好像它是当前的租约持有者。由于我们无法完全排除僵尸,我们必须确保它们不会以脑裂的形式造成任何损害。这被称为隔离僵尸。

Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them from the network [9], shutting down the VM via the cloud provider’s management interface, or even physically powering down the machine [87]. This approach is known as Shoot The Other Node In The Head or STONITH. Unfortunately, it suffers from some problems: it does not protect against large network delays like in Figure 9-5; it can happen that all of the nodes shut each other down [19]; and by the time the zombie has been detected and shut down, it may already be too late and data may already have been corrupted.
一些系统试图通过关闭它们来隔离僵尸系统,例如通过断开网络连接 [9]、通过云服务提供商的管理界面关闭虚拟机,甚至物理关闭机器 [87]。这种方法被称为 "射击其他节点的头" 或 STONITH。不幸的是,这种方法存在一些问题:它无法防止像图 9-5 中那样的大网络延迟;可能会发生所有节点互相关闭的情况 [19];而且当检测到僵尸系统并关闭它时,可能已经太晚了,数据可能已经损坏。

A more robust fencing solution, which protects against both zombies and delayed requests, is illustrated in Figure 9-6.
一种更健壮的隔离解决方案,可以同时防止僵尸系统和延迟请求,如图 9-6 所示。

ddia 0906

Let’s assume that every time the lock service grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token.
假设每次锁服务授予锁或租用时,它还会返回一个隔离令牌,这是一个每次授予锁时增加的数字(例如,由锁服务递增)。然后我们可以要求每次客户端向存储服务发送写入请求时,都必须包含其当前的隔离令牌。

Note 注意#

There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are called sequencers [88], and in Kafka they are called epoch numbers. In consensus algorithms, which we will discuss in Chapter 10, the ballot number (Paxos) or term number (Raft) serves a similar purpose.
栅栏令牌有几个替代名称。在 Chubby(Google 的锁服务)中,它们被称为序列号 [88],在 Kafka 中,它们被称为时代号。在我们在第 10 章将讨论的一致性算法中,投票号(Paxos)或任期号(Raft)具有类似的作用。

In Figure 9-6, client 1 acquires the lease with a token of 33, but then it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the number always increases) and then sends its write request to the storage service, including the token of 34. Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage service remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33. A client that has just acquired the lease must immediately make a write to the storage service, and once that write has completed, any zombies are fenced off.
在图 9-6 中,客户端 1 使用令牌 33 获取租约,但随后进入长时间的暂停状态,租约过期。客户端 2 使用令牌 34(编号总是递增)获取租约,然后将其写请求发送给存储服务,包括令牌 34。后来,客户端 1 恢复活动并发送其写请求到存储服务,包括其令牌值 33。然而,存储服务记得它已经处理了一个具有更高令牌编号的写请求(34),因此它拒绝令牌为 33 的请求。刚刚获取租约的客户端必须立即向存储服务发起写操作,一旦该写操作完成,任何僵尸进程都会被隔离。

If ZooKeeper is your lock service, you can use the transaction ID zxid or the node version cversion as fencing token [85]. With etcd, the revision number along with the lease ID serves a similar purpose [89]. The FencedLock API in Hazelcast explicitly generates a fencing token [90].
如果 ZooKeeper 是你的锁服务,你可以使用事务 ID zxid 或节点版本 cversion 作为围栏令牌 [85]。对于 etcd,修订号与租约 ID 共同实现类似功能 [89]。Hazelcast 中的 FencedLock API 明确生成围栏令牌 [90]。

This mechanism requires that the storage service has some way of checking whether a write is based on an outdated token. Alternatively, it’s sufficient for the service to support a write that succeeds only if the object has not been written by another client since the current client last read it, similarly to an atomic compare-and-set (CAS) operation. For example, object storage services support such a check: Amazon S3 calls it conditional writes, Azure Blob Storage calls it conditional headers, and Google Cloud Storage calls it request preconditions.
这种机制要求存储服务能够检查写入是否基于过时的令牌。或者,服务支持仅在对象自当前客户端上次读取以来未被其他客户端写入时才成功的写入,类似于原子比较并设置(CAS)操作。例如,对象存储服务支持此类检查:Amazon S3 称之为条件写入,Azure Blob Storage 称之为条件标头,Google Cloud Storage 称之为请求前提条件。

Fencing with multiple replicas 使用多个副本进行围栏#

If your clients need to write only to one storage service that supports such conditional writes, the lock service is somewhat redundant [91,92], since the lease assignment could have been implemented directly based on that storage service [93]. However, once you have a fencing token you can also use it with multiple services or replicas, and ensure that the old leaseholder is fenced off on all of those services.
如果您的客户端只需要写入一个支持此类条件写入的存储服务,那么锁服务有点多余 [91, 92],因为租约分配可以直接基于该存储服务实现 [93]。然而,一旦您有了隔离令牌,您也可以将其用于多个服务或副本,并确保旧的租约持有者在所有这些服务上都被隔离。

For example, imagine the storage service is a leaderless replicated key-value store with last-write-wins conflict resolution (see “Leaderless Replication”). In such a system, the client sends writes directly to each replica, and each replica independently decides whether to accept a write based on a timestamp assigned by the client.
例如,假设存储服务是一个无主复制键值存储,采用最后写入者胜出的冲突解决机制(参见 “无主复制”)。在这样的系统中,客户端直接将写入发送到每个副本,并且每个副本独立地根据客户端分配的时间戳决定是否接受写入。

As illustrated in Figure 9-7, you can put the writer’s fencing token in the most significant bits or digits of the timestamp. You can then be sure that any timestamp generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even if the old leaseholder’s writes happened later.
如图 9-7 所示,您可以将写入者的隔离令牌放在时间戳的最高位或数字上。然后您可以确信,新租约持有者生成的时间戳将大于旧租约持有者生成的时间戳,即使旧租约持有者的写入发生得较晚。

ddia 0907

In Figure 9-7, Client 2 has a fencing token of 34, so all of its timestamps starting with 34… are greater than any timestamps starting with 33… that are generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even though it is ignored by replicas 1 and 2. This is not a problem, since a subsequent quorum read will prefer the write from Client 2 with the greater timestamp, and read repair or anti-entropy will eventually overwrite the value written by Client 1.
在图 9-7 中,客户端 2 拥有围栏令牌 34,因此它所有以 34 开头的时戳都大于客户端 1 生成的以 33 开头的任何时戳。客户端 2 写入副本的多数派,但它无法到达副本 3。这意味着当僵尸客户端 1 后来尝试写入时,即使副本 1 和 2 忽略了它,它的写入也可能在副本 3 上成功。这不是问题,因为随后的多数派读取将优先选择具有较大时戳的客户端 2 的写入,并且最终读修复或抗熵将覆盖客户端 1 写入的值。

As you can see from these examples, it is not safe to assume that there is only one node holding a lease at any one time. Fortunately, with a bit of care you can use fencing tokens to prevent zombies and delayed requests from doing any damage.
从这些示例中可以看出,不能假设在任何时候只有一个节点持有租约。幸运的是,稍加小心即可使用围栏令牌来防止僵尸和延迟请求造成任何损害。

Byzantine Faults 拜占庭错误#

Fencing tokens can detect and block a node that is inadvertently acting in error (e.g., because it hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.
围栏令牌可以检测并阻止一个节点意外地出错(例如,因为它尚未发现其租约已过期)。然而,如果节点故意想要破坏系统的保证,它可以通过发送带有伪造围栏令牌的消息轻松做到这一点。

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume that if a node does respond, it is telling the “truth”: to the best of its knowledge, it is playing by the rules of the protocol.
在本书中,我们假设节点是不可靠但诚实的:它们可能缓慢或从不响应(由于故障),其状态可能过时(由于 GC 暂停或网络延迟),但我们假设如果节点做出响应,它是在说 “真话”:据其了解,它是在遵守协议的规则。

Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in the same election. Such behavior is known as a Byzantine fault, and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem [94].
如果存在节点可能 “撒谎”(发送任意错误的或损坏的响应)的风险,分布式系统问题会变得困难得多 —— 例如,在相同的选举中投出多个相互矛盾的选票。这种行为被称为拜占庭错误,而在这种不信任的环境中达成共识的问题被称为拜占庭将军问题 [94]。

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances. For example:
一个系统如果即使在部分节点出现故障且不遵守协议,或者恶意攻击者干扰网络的情况下仍能正常运行,则称其为拜占庭容错的。这种考虑在某些特定情况下是相关的。例如:

  • In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board, or a rocket colliding with the International Space Station), flight control systems must tolerate Byzantine faults [98,99].
    在航空航天环境中,计算机内存或 CPU 寄存器中的数据可能会因辐射而损坏,导致其以任意不可预测的方式响应其他节点。由于系统故障会非常昂贵(例如,飞机坠毁并导致机上所有人死亡,或火箭与国际空间站相撞),飞行控制系统必须容忍拜占庭式故障 [98, 99]。
  • In a system with multiple participating parties, some participants may attempt to cheat or defraud others. In such circumstances, it is not safe for a node to simply trust another node’s messages, since they may be sent with malicious intent. For example, cryptocurrencies like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority [100].
    在一个有多方参与的系统里,一些参与者可能会试图欺骗或诈骗他人。在这种情况下,一个节点仅仅信任另一个节点的消息是不安全的,因为这些消息可能是出于恶意意图发送的。例如,像比特币这样的加密货币和其他区块链可以被认为是一种让互不信任的各方就交易是否发生达成一致的方法,而无需依赖中央权威 [100]。

However, in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a major problem (although datacenters in orbit are being considered [101]). Multitenant systems have mutually untrusting tenants, but they are isolated from each other using firewalls, virtualization, and access control policies, not using Byzantine fault tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive [102], and fault-tolerant embedded systems rely on support from the hardware level [98]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impracticable.
然而,在我们讨论的这类系统中,通常可以安全地假设不存在拜占庭式故障。在数据中心中,所有节点都由您的组织控制(因此可以希望它们是值得信任的),辐射水平足够低,以至于内存损坏不是主要问题(尽管正在考虑轨道上的数据中心 [101])。多租户系统中的租户相互不信任,但它们通过防火墙、虚拟化和访问控制策略相互隔离,而不是使用拜占庭式容错。实现系统拜占庭式容错的协议相当昂贵 [102],而容错嵌入式系统依赖于硬件级别的支持 [98]。在大多数服务器端数据系统中,部署拜占庭式容错解决方案的成本使得它们不切实际。

Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where there is no such central authority, Byzantine fault tolerance is more relevant [103,104].
Web 应用程序确实需要预期到由最终用户控制且可能存在任意或恶意行为的客户端,例如 Web 浏览器。这就是为什么输入验证、清理和输出转义如此重要:例如,为了防止 SQL 注入和跨站脚本攻击。然而,我们通常不在这里使用拜占庭容错协议,而是简单地让服务器成为决定哪些客户端行为是被允许或不被允许的权威。在点对点网络中,由于没有这样的中央权威,拜占庭容错更为相关 [103, 104]。

A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly (for example, if you have four nodes, at most one may malfunction). To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.
软件中的错误可能被视为拜占庭错误,但如果将相同软件部署到所有节点,那么拜占庭容错算法也无法救你。大多数拜占庭容错算法需要超过三分之二的多数节点正常工作(例如,如果你有四个节点,最多只能有一个出现故障)。要使用这种方法来应对错误,你必须要有四个相同软件的独立实现,并希望错误只出现在这四个实现中的一个。

Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if an attacker can compromise one node, they can probably compromise all of them, because they are probably running the same software. Thus, traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.
同样地,如果协议能够保护我们免受漏洞、安全妥协和恶意攻击,那将非常有吸引力。不幸的是,这也不现实:在大多数系统中,如果攻击者能够攻破一个节点,他们很可能能够攻破所有节点,因为它们很可能运行相同的软件。因此,传统机制(认证、访问控制、加密、防火墙等)仍然是抵御攻击者的主要保护手段。

Weak forms of lying 撒谎的弱形式#

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software that guard against weak forms of “lying”—for example, invalid messages due to hardware issues, software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and pragmatic steps toward better reliability. For example:
尽管我们假设节点通常是诚实的,但添加一些机制来防范软件中的 “弱形式谎言”—— 例如由于硬件问题、软件错误和配置错误导致的无效消息 —— 可能是有价值的。这些保护机制并非完整的拜占庭容错机制,因为它们无法抵御蓄意的对手,但它们仍然是提高可靠性的简单且务实的步骤。例如:

  • Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems, drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection [105,106,107]. Simple measures are usually sufficient protection against such corruption, such as checksums in the application-level protocol. TLS-encrypted connections also offer protection against corruption.
    网络数据包有时会因硬件问题或操作系统、驱动程序、路由器等的错误而损坏。通常,损坏的数据包会被 TCP 和 UDP 内置的校验和捕获,但有时它们会逃避检测 [105, 106, 107]。简单的措施通常足以防范此类损坏,例如应用层协议中的校验和。TLS 加密连接也能提供对损坏的保护。
  • A publicly accessible application must carefully sanitize any inputs from users, for example checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations. An internal service behind a firewall may be able to get away with less strict checks on inputs, but basic checks in protocol parsers are still a good idea [105].
    一个公开可访问的应用程序必须仔细清理来自用户的任何输入,例如检查值是否在合理范围内,并限制字符串的大小以防止通过大量内存分配导致服务拒绝。一个位于防火墙后面的内部服务可能在输入检查上稍微宽松一些,但在协议解析器中的基本检查仍然是个好主意 [105]。
  • NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an incorrect time is detected as an outlier and is excluded from synchronization [39]. The use of multiple servers makes NTP more robust than if it only uses a single server.
    NTP 客户端可以配置多个服务器地址。在同步时,客户端会联系所有这些服务器,评估它们的误差,并检查多数服务器是否同意某个时间范围。只要大多数服务器正常,一个配置错误的 NTP 服务器报告了不正确的时间就会被检测为异常值并排除在同步之外 [39]。使用多个服务器使得 NTP 比仅使用单个服务器更加健壮。

System Model and Reality 系统模型与现实#

Many algorithms have been designed to solve distributed systems problems—for example, we will examine solutions for the consensus problem in Chapter 10. In order to be useful, these algorithms need to tolerate the various faults of distributed systems that we discussed in this chapter.
许多算法已被设计用来解决分布式系统问题 —— 例如,我们将在第 10 章中探讨共识问题的解决方案。为了有用,这些算法需要容忍本章讨论的分布式系统的各种故障。

Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run. This in turn requires that we somehow formalize the kinds of faults that we expect to happen in a system. We do this by defining a system model, which is an abstraction that describes what things an algorithm may assume.
算法需要以一种不依赖于运行时硬件和软件配置细节的方式编写。这反过来又要求我们以某种方式形式化系统可能发生的故障类型。我们通过定义系统模型来实现这一点,该模型是一种抽象,描述了算法可能假设的事项。

With regard to timing assumptions, three system models are in common use:
关于时间假设,有三种常见的系统模型:

Synchronous model 同步模型

The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound [108]. The synchronous model is not a realistic model of most practical systems, because (as discussed in this chapter) unbounded delays and pauses do occur.
同步模型假设有界网络延迟、有界进程暂停和有界时钟误差。这并不意味着时钟完全同步或网络延迟为零;它只是意味着你知道网络延迟、暂停和时钟漂移永远不会超过某个固定的上限 [108]。同步模型并不是大多数实际系统的现实模型,因为(如本章所述)无界延迟和暂停确实会发生。

Partially synchronous model
部分同步模型

Partial synchrony means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift [108]. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved—otherwise we would never be able to get anything done—but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become arbitrarily large.
部分同步意味着系统大部分时间表现得像同步系统,但有时会超过网络延迟、进程暂停和时钟漂移的界限 [108]。这是一个许多系统的现实模型:大部分时间,网络和进程表现相当良好 —— 否则我们永远无法完成任何工作 —— 但我们不得不考虑到任何时间假设都可能偶尔被打破。当这种情况发生时,网络延迟、暂停和时钟误差可能会变得任意大。

Asynchronous model 异步模型

In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock (so it cannot use timeouts). Some algorithms can be designed for the asynchronous model, but it is very restrictive.
在这个模型中,算法不允许做出任何时间假设 —— 事实上,它甚至没有时钟(因此它不能使用超时)。有些算法可以设计为异步模型,但它非常具有限制性。

Moreover, besides timing issues, we have to consider node failures. Some common system models for nodes are:
此外,除了时间问题,我们还需要考虑节点故障。常见的节点系统模型有:

Crash-stop faults 崩溃停止故障

In the crash-stop (or fail-stop) model, an algorithm may assume that a node can fail in only one way, namely by crashing [109]. This means that the node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back.
在崩溃停止(或失效停止)模型中,算法可以假设节点只可能以一种方式失效,即崩溃 [109]。这意味着节点可能在任何时刻突然停止响应,之后该节点就永远消失了 —— 它再也不会回来。

Crash-recovery faults 崩溃恢复故障

We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed to be lost.
我们假设节点可能在任何时刻崩溃,并且可能在未知时间后重新开始响应。在崩溃恢复模型中,节点被假定为具有稳定存储(即非易失性磁盘存储),这种存储在崩溃时得以保留,而内存中的状态则被假定为丢失。

Degraded performance and partial functionality
性能下降和部分功能

In addition to crashing and restarting, nodes may go slow: they may still be able to respond to health check requests, while being too slow to get any real work done. For example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [110]; a process that is under memory pressure may spend most of its time performing garbage collection [111]; worn-out SSDs can have erratic performance; and hardware can be affected by high temperature, loose connectors, mechanical vibration, power supply problems, firmware bugs, and more [112]. Such a situation is called a limping node, gray failure, or fail-slow [113], and it can be even more difficult to deal with than a cleanly failed node. A related problem is when a process stops doing some of the things it is supposed to do while other aspects continue working, for example because a background thread is crashed or deadlocked [114].
除了崩溃和重启,节点还可能运行缓慢:它们可能仍然能够响应健康检查请求,但速度慢到无法完成任何实际工作。例如,由于驱动程序错误,一个千兆网络接口可能会突然降至 1 Kb / s 的吞吐量 [110];内存压力下的进程可能会将大部分时间用于执行垃圾回收 [111];磨损的 SSD 性能可能不稳定;硬件可能受高温、松动的连接器、机械振动、电源问题、固件错误等因素影响 [112]。这种情况被称为跛行节点、灰色故障或缓慢失效 [113],处理起来可能比干净失败的节点更困难。相关问题还包括进程停止执行其预期部分的工作,而其他方面仍在运行,例如因为后台线程崩溃或死锁 [114]。

Byzantine (arbitrary) faults
拜占庭(任意)故障

Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.
节点可以绝对任意地行动,包括试图欺骗和误导其他节点,如上一节所述。

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model. It allows for unbounded network delay, process pauses, and slow nodes. But how do distributed algorithms cope with that model?
对于建模真实系统,具有崩溃恢复故障的部分同步模型通常是最有用的模型。它允许无界网络延迟、进程暂停和慢节点。但分布式算法如何应对该模型?

Defining the correctness of an algorithm 定义算法的正确性#

To define what it means for an algorithm to be correct, we can describe its properties. For example, the output of a sorting algorithm has the property that for any two distinct elements of the output list, the element further to the left is smaller than the element further to the right. That is simply a formal way of defining what it means for a list to be sorted.
为了定义算法正确性的含义,我们可以描述其属性。例如,排序算法的输出具有这样的属性:对于输出列表中的任意两个不同元素,位于左侧的元素小于位于右侧的元素。这只是一个形式化的方式,用于定义列表排序的含义。

Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct. For example, if we are generating fencing tokens for a lock (see “Fencing off zombies and delayed requests”), we may require the algorithm to have the following properties:
类似地,我们可以列出我们希望分布式算法具备的属性,以定义其正确性的含义。例如,如果我们正在为锁生成围栏令牌(参见 “隔离僵尸和延迟请求”),我们可能要求算法具备以下属性:

Uniqueness 唯一性

No two requests for a fencing token return the same value.
请求围栏令牌的两个请求不会返回相同的值。

Monotonic sequence 单调序列

If request x returned token t x, and request y returned token t y, and x completed before y began, then t x  <  t y.
如果请求 x 返回令牌 t x ,请求 y 返回令牌 t y ,并且 x 在 y 开始之前完成,那么 t x < t y

Availability 可用性

A node that requests a fencing token and does not crash eventually receives a response.
请求一个锁令牌且最终不会崩溃的节点会收到响应。

An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model. However, if all nodes crash, or all network delays suddenly become infinitely long, then no algorithm will be able to get anything done. How can we still make useful guarantees even in a system model that allows complete failures?
如果一个算法在所有我们假设可能在该系统模型中发生的所有情况下始终满足其属性,那么该算法在该系统模型中是正确的。然而,如果所有节点都崩溃,或者所有网络延迟突然变得无限长,那么没有任何算法能够完成任何事。即使系统模型允许完全故障,我们仍然如何才能做出有用的保证?

Safety and liveness 安全性和活性#

To clarify the situation, it is worth distinguishing between two different kinds of properties:safety and liveness properties. In the example just given, uniqueness and monotonic sequence are safety properties, but availability is a liveness property.
为了阐明情况,区分两种不同类型的属性是值得的:安全性和活性属性。在刚刚给出的例子中,唯一性和单调序列是安全性属性,但可用性是活性属性。

What distinguishes the two kinds of properties? A giveaway is that liveness properties often include the word “eventually” in their definition. (And yes, you guessed it— eventual consistency is a liveness property [115].)
区分这两种属性的关键在于,活性属性在其定义中通常包含 “最终” 一词。(是的,你猜对了 —— 最终一致性是一种活性属性 [115]。)

Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens. However, it’s best to not read too much into those informal definitions, because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual definitions of safety and liveness are more precise [116]:
安全性通常被非正式定义为 “不会发生坏事”,而活性则定义为 “最终会发生好事”。然而,最好不要对这些非正式定义做过多解读,因为 “好” 和 “坏” 是价值判断,并不适用于算法。安全性和活性的实际定义更为精确 [116]:

  • If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned). After a safety property has been violated, the violation cannot be undone—the damage is already done.
    如果违反了安全性属性,我们可以指出一个特定的时间点,属性在此刻被破坏(例如,如果唯一性属性被违反,我们可以识别出返回了重复的锁定令牌的具体操作)。一旦违反了安全性属性,这种违反无法撤销 —— 损害已经造成。
  • A liveness property works the other way round: it may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).
    活性属性则正好相反:它可能在某个时间点不成立(例如,一个节点可能发送了请求但尚未收到响应),但总存在未来可能满足它的希望(即通过收到响应)。

An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties always hold, in all possible situations of a system model [108]. That is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result (i.e., that the safety properties remain satisfied).
区分安全性和活性属性的好处在于它有助于我们处理复杂的系统模型。对于分布式算法,通常要求安全属性在系统模型的任何可能情况下始终成立 [108]。也就是说,即使所有节点崩溃,或者整个网络失效,算法也必须确保不会返回错误结果(即安全属性仍然得到满足)。

However, with liveness properties we are allowed to make caveats: for example, we could say that a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage. The definition of the partially synchronous model requires that eventually the system returns to a synchronous state—that is, any period of network interruption lasts only for a finite duration and is then repaired.
然而,对于活性属性,我们允许做出一些限制:例如,我们可以规定只有当大多数节点没有崩溃时,请求才需要收到响应,并且只有当网络最终从故障中恢复时才需要。部分同步模型的定义要求系统最终会恢复到同步状态 —— 也就是说,任何网络中断的持续时间都是有限的,并且随后会被修复。

Mapping system models to the real world 将系统模型映射到现实世界#

Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of reality come back to bite you again, and it becomes clear that the system model is a simplified abstraction of reality.
安全性和活性属性以及系统模型对于推理分布式算法的正确性非常有用。然而,在实际实现算法时,现实的混乱事实会再次让你吃亏,并且清楚地表明系统模型是现实的简化抽象。

For example, algorithms in the crash-recovery model generally assume that data in stable storage survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration [117]? What happens if a server has a firmware bug and fails to recognize its hard drives on reboot, even though the drives are correctly attached to the server [118]?
例如,崩溃恢复模型中的算法通常假设稳定存储中的数据在崩溃后仍然存在。然而,如果磁盘上的数据损坏,或者由于硬件错误或配置错误导致数据被清除 [117] 会怎样?如果服务器存在固件错误,在重启时无法识别其硬盘,即使硬盘已正确连接到服务器 [118] 会怎样?

Quorum algorithms (see “Quorums for reading and writing”) rely on a node remembering the data that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new system model is needed, in which we assume that stable storage mostly survives crashes, but may sometimes be lost. But that model then becomes harder to reason about.
仲裁算法(参见 “读写仲裁”)依赖于节点记住其声称存储的数据。如果一个节点可能患有健忘症并忘记之前存储的数据,这将破坏仲裁条件,从而破坏算法的正确性。也许需要一个新的系统模型,其中我们假设稳定存储在大多数情况下能够经受住崩溃,但有时可能会丢失。但这个模型推理起来会更困难。

The theoretical description of an algorithm can declare that certain things are simply assumed not to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can and cannot happen. However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to printf("Sucks to be you") and exit(666) —i.e., letting a human operator clean up the mess [119]. (This is one difference between computer science and software engineering.)
算法的理论描述可以声明某些事情根本不可能发生 —— 在非拜占庭系统中,我们确实需要对可能和不可能发生的故障做出一些假设。然而,实际实现可能仍需包含处理那些被假定为不可能发生的情况的代码,即使这种处理最终归结为 printf("Sucks to be you")exit(666) —— 即,让操作员清理混乱 [119]。(这是计算机科学和软件工程之间的一点区别。)

That is not to say that theoretical, abstract system models are worthless—quite the opposite. They are incredibly helpful for distilling down the complexity of real systems to a manageable set of faults that we can reason about, so that we can understand the problem and try to solve it systematically.
这并不是说理论上的抽象系统模型毫无价值 —— 恰恰相反。它们对于将真实系统的复杂性提炼为我们可以推理的管理范围内的故障集非常有帮助,这样我们就能理解问题并系统地尝试解决它。

Formal Methods and Randomized Testing 形式化方法和随机测试#

How do we know that an algorithm satisfies the required properties? Due to concurrency, partial failures, and network delays there are a huge number of potential states. We need to guarantee that the properties hold in every possible state, and ensure that we haven’t forgotten about any edge cases.
我们如何知道算法满足所需属性?由于并发性、部分故障和网络延迟,存在大量潜在状态。我们需要保证属性在每个可能的状态中都成立,并确保我们没有遗漏任何边缘情况。

One approach is to formally verify an algorithm by describing it mathematically, and using proof techniques to show that it satisfies the required properties in all situations that the system model allows. Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly. But it’s a very good first step, because the theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances.
一种方法是通过对算法进行数学描述,并使用证明技术来展示它在系统模型允许的所有情况下都满足所需属性,从而对算法进行形式化验证。证明算法正确并不意味着它在真实系统上的实现一定会始终表现正确。但它是一个非常好的第一步,因为理论分析可以揭示算法中可能长时间隐藏在真实系统中而未被发现的問題,这些问题只有在你的假设(例如关于时序的假设)由于异常情况而被推翻时才会显现出来。

It is prudent to combine theoretical analysis with empirical testing to verify that implementations behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation testing (DST) use randomization to test a system in a wide range of situations. Companies such as Amazon Web Services have successfully used a combination of these techniques on many of their products [120,121].
结合理论分析与实证测试来验证实现是否按预期工作是一种明智的做法。属性驱动测试、模糊测试和确定性仿真测试(DST)等技术使用随机化来在各种情况下测试系统。亚马逊网络服务公司等公司已经成功地将这些技术组合起来应用于其许多产品 [120, 121]。

Model checking and specification languages 模型检验与规范语言#

Model checkers are tools that help verify that an algorithm or system behaves as expected. An algorithm specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. These languages make it easier to focus on an algorithm’s behavior without worrying about code implementation details. Model checkers then use these models to verify that invariants hold across all of an algorithm’s states by systematically trying all the things that could happen.
模型检验器是帮助验证算法或系统是否按预期运行的工具。算法规范是用专门构建的语言编写的,如 TLA+、Gallina 或 FizzBee。这些语言使人们能够专注于算法的行为,而无需担心代码实现的细节。模型检验器随后使用这些模型,通过系统地尝试所有可能发生的事情,来验证算法的所有状态是否都保持不变式。

Model checking can’t actually prove that an algorithm’s invariants hold for every possible state since most real-world algorithms have an infinite state space. A true verification of all states would require a formal proof, which can be done, but which is typically more difficult than running a model checker. Instead, model checkers encourage you to reduce the algorithm’s model to an approximation that can be fully verified, or to limit the execution to some upper bound (for example, by setting a maximum number of messages that can be sent). Any bugs that only occur with longer executions would then not be found.
模型检验实际上无法证明算法的不变式在所有可能的状态下都成立,因为大多数现实世界的算法都有无限的状态空间。对所有状态进行真正验证需要形式化证明,这虽然可以完成,但通常比运行模型检验器更困难。相反,模型检验器鼓励你将算法的模型简化为可以完全验证的近似模型,或者限制执行到某个上限(例如,通过设置可以发送的最大消息数)。那么,只有在更长时间运行中才会出现的任何错误将不会被找到。

Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find and fix bugs [122,123,124]. For example, using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped replication (VR) caused by ambiguity in the prose description of the algorithm [125].
然而,模型检查器在易用性和发现非明显错误的能力之间取得了很好的平衡。CockroachDB、TiDB、Kafka 以及其他许多分布式系统使用模型规范来查找和修复错误 [122, 123, 124]。例如,使用 TLA+,研究人员能够展示由于算法的文本描述中的模糊性导致的视图标记复制(VR)中的数据丢失的潜在可能性 [125]。

By design, model checkers don’t run your actual code, but rather a simplified model that specifies only the core ideas of your protocol. This makes it more tractable to systematically explore the state space, but it risks that your specification and your implementation go out of sync with each other [126]. It is possible to check whether the model and the real implementation have equivalent behavior, but this requires instrumentation in the real implementation [127].
按设计,模型检查器不会运行你的实际代码,而是运行一个简化的模型,该模型仅指定了你的协议的核心思想。这使得系统地探索状态空间变得更加可行,但它也增加了你的规范和实现之间不同步的风险 [126]。可以检查模型和实际实现是否具有等价行为,但这需要在实际实现中进行仪器化 [127]。

Fault injection 故障注入#

Many bugs are triggered when machine and network failures occur. Fault injection is an effective (and sometimes scary) technique that verifies whether a system’s implementation works as expected things go wrong. The idea is simple: inject faults into a running system’s environment and see how it behaves. Faults can be network failures, machine crashes, disk corruption, paused processes—anything you can imagine going wrong with a computer.
许多错误会在机器和网络故障发生时触发。故障注入是一种有效(有时令人恐惧)的技术,用于验证系统在意外情况下的实现是否按预期工作。其原理很简单:向正在运行的系统环境中注入故障,观察其行为。故障可以是网络故障、机器崩溃、磁盘损坏、进程暂停 —— 任何你能想象到的计算机可能出现的故障。

Fault injection tests are typically run in an environment that closely resembles the production environment where the system will run. Some even inject faults directly into their production environment. Netflix popularized this approach with their Chaos Monkey tool [128]. Production fault injection is often referred to as chaos engineering, which we discussed in “Reliability and Fault Tolerance”.
故障注入测试通常在接近系统将运行的生产环境的测试环境中进行。有些人甚至直接在生产环境中注入故障。Netflix 通过他们的混沌猴工具 [128] 推广了这种方法。生产环境中的故障注入通常被称为混沌工程,我们在 “可靠性和容错性” 中讨论过。

To run fault injection tests, the system under test is first deployed along with fault injection coordinators and scripts. Coordinators are responsible for deciding what faults to execute and when to execute them. Local or remote scripts are responsible for injecting failures into individual nodes or processes. Injection scripts use many different tools to trigger faults. A Linux process can be paused or killed using Linux’s kill command, a disk can be unmounted with umount, and network connections can be disrupted through firewall settings. You can inspect system behavior during and after faults are injected to make sure things work as expected.
运行故障注入测试时,首先部署待测系统以及故障注入协调器和脚本。协调器负责决定执行何种故障以及何时执行。本地或远程脚本负责向单个节点或进程注入故障。注入脚本使用多种不同的工具来触发故障。Linux 进程可以使用 Linux 的 kill 命令暂停或终止,磁盘可以通过 umount 卸载,网络连接可以通过防火墙设置中断。你可以检查注入故障期间及之后的系统行为,以确保一切按预期工作。

The myriad of tools required to trigger failures make fault injection tests cumbersome to write. It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to simplify the process. Such frameworks come with integrations for various operating systems and many pre-built fault injectors [129]. Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [130,131].
触发故障所需的众多工具使得故障注入测试编写起来非常繁琐。通常采用像 Jepsen 这样的故障注入框架来运行故障注入测试,以简化流程。此类框架提供了针对各种操作系统的集成以及许多预构建的故障注入器 [129]。Jepsen 在发现许多广泛使用的系统中的关键错误方面表现出色 [130,131]。

Deterministic simulation testing 确定性模拟测试#

Deterministic simulation testing (DST) has also become a popular complement to model-checking and fault injection. It uses a similar state space exploration process as a model checker, but it tests your actual code, not a model.
确定性模拟测试(DST)也已成为模型检查和故障注入的流行补充。它像模型检查器一样使用相似的状态空间探索过程,但它测试的是你的实际代码,而不是模型。

In DST, a simulation automatically runs through a large number of randomised executions of the system. Network communication, I/O, and clock timing during the simulation are all replaced with mocks that allow the simulator to control the exact order in which things happen, including various timings and failure scenarios. This allows the simulator to explore many more situations than hand-written tests or fault injection could. If a test fails, it can be re-run since the simulator knows the exact order of operations that triggered the failure—in contrast to fault injection, which does not have such fine-grained control over the system.
在 DST 中,模拟器会自动运行系统的大量随机化执行。模拟期间的网络通信、I / O 和时钟计时都被替换为模拟对象,允许模拟器控制事件发生的确切顺序,包括各种时间和故障场景。这使得模拟器能够探索比手工测试或故障注入更多的场景。如果测试失败,可以重新运行,因为模拟器知道触发失败的精确操作顺序 —— 这与故障注入形成对比,后者对系统没有如此精细的控制。

DST requires the simulator to be able to control all sources of nondeterminism, such as network delays. One of three strategies is generally adopted to make code deterministic:
DST 要求模拟器能够控制所有非确定性的来源,例如网络延迟。通常采用三种策略之一使代码确定性:

Application-level 应用级

Some systems are built from the ground-up to make it easy to execute code deterministically. For example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous communication library called Flow. Flow provides a point for developers to inject a deterministic network simulation into the system [132]. Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST support. The system’s state is modeled as a state machine, with all mutations occuring within a single event loop. When combined with mock deterministic primitives such as clocks, such an architecture is able to run deterministically [133].
有些系统从底层开始构建,旨在简化确定性代码的执行。例如,在分布式系统技术(DST)领域的一项先驱技术 FoundationDB,是使用名为 Flow 的异步通信库构建的。Flow 为开发者提供了一个将确定性网络模拟注入系统的接口 [132]。类似地,TigerBeetle 是一个具有一流 DST 支持的开源交易处理(OLTP)数据库。系统的状态被建模为状态机,所有变更都在单个事件循环内发生。当结合使用时钟等模拟确定性原语时,这种架构能够以确定性方式运行 [133]。

Runtime-level 运行时级别

Languages with asynchronous runtimes and commonly used libraries provide an insertion point to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially [134]. Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others. Applications can swap in deterministic libraries and runtimes to get deterministic test executions without changing their code.
具有异步运行时和常用库的语言提供了引入确定性的切入点。单线程运行时被用来强制所有异步代码顺序执行。例如,FrostDB 通过修补 Go 的运行时来顺序执行 goroutines [134]。Rust 的 madsim 库以类似的方式工作。Madsim 提供了 Tokio 异步运行时 API、AWS 的 S3 库、Kafka 的 Rust 库以及其他许多库的确定性实现。应用程序可以替换为确定性库和运行时,以获得确定性测试执行,而无需更改其代码。

Machine-level 机器级别

Rather than patching code at runtime, an entire machine can be made deterministic. This is a delicate process that requires a machine to respond to all normally nondeterministic calls with deterministic responses. Tools such as Antithesis do this by building a custom hypervisor that replaces normally nondeterministic operations with deterministic ones. Everything from clocks to network and storage needs to be accounted for. Once done, though, developers can run their entire distributed system in a collection of containers within the hypervisor and get a completely deterministic distributed system.
与其在运行时修补代码,整台机器可以被设计为确定性。这是一个精细的过程,需要机器对所有通常非确定性的调用作出确定性响应。像 Antithesis 这样的工具通过构建一个自定义的虚拟机管理程序,将通常非确定性的操作替换为确定性操作来实现这一点。从时钟到网络和存储,所有方面都需要被考虑进去。一旦完成,开发者就可以在虚拟机管理程序中的一组容器中运行其整个分布式系统,从而得到一个完全确定性的分布式系统。

DST provides several advantages beyond replayability. Tools such as Antithesis attempt to explore many different code paths in application code by branching a test execution into multiple sub-executions when it discovers less common behavior. And because deterministic tests often use mocked clocks and network calls, such tests can run faster than wall-clock time. For example, TigerBeetle’s time abstraction allows simulations to simulate network latency and timeouts without actually taking the full length of time to trigger the timeout. Such techniques allow the simulator to explore more code paths faster.
DST 除了可重放性之外,还提供了多种优势。像 Antithesis 这样的工具试图通过在发现不常见行为时将测试执行分支成多个子执行来探索应用程序代码中的许多不同代码路径。并且由于确定性测试通常使用模拟时钟和网络调用,因此此类测试可以比实际时钟时间运行得更快。例如,TigerBeetle 的时间抽象允许模拟器在不实际花费完整时间来触发超时的情况下模拟网络延迟和超时。这些技术使模拟器能够更快地探索更多代码路径。

Summary 摘要#

In this chapter we have discussed a wide range of problems that can occur in distributed systems, including:
在本章中,我们讨论了分布式系统中可能出现的广泛问题,包括:

  • Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.
    每当你试图通过网络发送数据包时,它可能会丢失或被任意延迟。同样,回复也可能丢失或延迟,所以如果你没有收到回复,你不知道消息是否已成功传输。
  • A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s confidence interval.
    节点的时钟可能与其他节点严重不同步(尽管你尽力设置了 NTP),它可能会突然向前或向后跳跃时间,依赖它是危险的,因为你很可能没有良好的时钟置信区间测量。
  • A process may pause for a substantial amount of time at any point in its execution, be declared dead by other nodes, and then come back to life again without realizing that it was paused.
    一个进程在其执行过程中的任何一点都可能暂停很长时间,被其他节点宣告死亡,然后又重新复活,而自己却不知道曾被暂停。

The fact that such partial failures can occur is the defining characteristic of distributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.
这种部分故障可能发生的事实是分布式系统的定义特征。每当软件试图做任何涉及其他节点的事情时,它都有可能偶尔失败,或者随机变慢,或者完全无响应(最终超时)。在分布式系统中,我们试图将部分故障的容忍性构建到软件中,以便即使其组成部分中有些损坏,整个系统仍能继续运行。

To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crashing. Handling limping nodes, which are responding but are too slow to do anything useful, is even harder.
为了容忍故障,第一步是检测它们,但这本身就是一项艰巨的任务。大多数系统都没有准确检测节点是否已失败的手段,因此大多数分布式算法依赖于超时来确定远程节点是否仍然可用。然而,超时无法区分网络故障和节点故障,并且可变的网络延迟有时会导致节点被错误地怀疑崩溃。处理那些响应但反应过慢以至于无法执行有用操作的 “跛脚” 节点,甚至更加困难。

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines [83]. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.
一旦检测到故障,让系统容忍它也并不容易:没有全局变量,没有共享内存,没有共同知识或任何其他类型的共享状态在机器之间 [83]。节点甚至无法就当前时间达成一致,更不用说就任何更深刻的问题达成一致了。信息从一台节点流向另一台节点的唯一方式是通过不可靠的网络发送。重大决策不能由单个节点安全地做出,因此我们需要那些请求其他节点帮助并试图获得多数同意的协议。

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer [4], and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and simply keep things on a single machine, for example by using an embedded storage engine (see “Embedded storage engines”), it is generally worth doing so.
如果你习惯于在单台计算机的理想化数学完美环境中编写软件,在那里相同的操作总是确定性地返回相同的结果,那么转向分布式系统的混乱物理现实可能会有些令人震惊。相反,分布式系统工程师如果能在单台计算机上解决问题,往往会认为这个问题微不足道 [4],事实上,单台计算机如今能做的事情很多。如果你能避免打开潘多拉的魔盒,并简单地保持事物在单台机器上,例如通过使用嵌入式存储引擎(参见 “嵌入式存储引擎”),那么通常值得这样做。

However, as discussed in “Distributed versus Single-Node Systems”, scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node. The power of distributed systems is that in principle, they can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.)
然而,正如在 “分布式系统与单节点系统” 章节中讨论的,可扩展性并非使用分布式系统的唯一原因。容错性和低延迟(通过将数据放置在靠近用户的位置)同样重要的目标,而这些目标无法通过单个节点实现。分布式系统的强大之处在于,原则上它们可以在服务级别上永远运行而不会中断,因为所有故障和维护都可以在节点级别处理。(在实践中,如果不良的配置更改被推送到所有节点,这仍然会使分布式系统崩溃。)

In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.
在本章中,我们还偏离主题探讨了网络、时钟和进程的不可靠性是否是自然法则。我们看到这并非如此:在网络中提供硬实时响应保证和有界延迟是可能的,但这样做非常昂贵,导致硬件资源利用率降低。大多数非安全关键系统选择廉价且不可靠而非昂贵且可靠。

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with the problems in distributed systems.
本章一直都在讨论问题,并让我们看到了黯淡的前景。在下一章中,我们将转向解决方案,讨论一些为应对分布式系统中的问题而设计的算法。

Footnotes 脚注#
References 参考文献#

[1] Mark Cavage.There’s Just No Getting Around It: You’re Building a Distributed System. ACM Queue, volume 11, issue 4, pages 80-89, April 2013.doi.1145/2466486.2482856
[1] Mark Cavage. 没有办法:你正在构建一个分布式系统。ACM Queue,第 11 卷,第 4 期,第 80-89 页,2013 年 4 月。doi.1145/2466486.2482856

[2] Jay Kreps.Getting Real About Distributed System Reliability. blog.empathybox.com, March 2012. Archived at perma.cc/9B5Q-AEBW
[2] Jay Kreps. 关于分布式系统可靠性的现实思考。blog.empathybox.com,2012 年 3 月。存档于 perma.cc / 9B5Q - AEBW

[3] Coda Hale.You Can’t Sacrifice Partition Tolerance. codahale.com, October 2010.https://perma.cc/6GJU-X4G5
[3] Coda Hale. 你不能牺牲分区容错性。codahale.com,2010 年 10 月。https://perma.cc/6GJU-X4G5

[4] Jeff Hodges.Notes on Distributed Systems for Young Bloods. somethingsimilar.com, January 2013. Archived at perma.cc/B636-62CE
[4] Jeff Hodges. 面向年轻人的分布式系统笔记。somethingsimilar.com,2013 年 1 月。存档于 perma.cc/B636-62CE

[5] Van Jacobson.Congestion Avoidance and Control. At ACM Symposium on Communications Architectures and Protocols (SIGCOMM), August 1988.doi.1145/52324.52356
[5] Van Jacobson. Congestion Avoidance and Control. 在 ACM 通信架构与协议研讨会(SIGCOMM)上,1988 年 8 月。doi.1145/52324.52356

[6] Bert Hubert.The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable. blog.netherlabs.nl, January 2009. Archived at perma.cc/6HDX-L2RR
[6] Bert Hubert. The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable. blog.netherlabs.nl, 2009 年 1 月。在 perma.cc / 6HDX - L2RR 存档

[7] Jerome H. Saltzer, David P. Reed, and David D. Clark.End-To-End Arguments in System Design. ACM Transactions on Computer Systems, volume 2, issue 4, pages 277–288, November 1984.doi.1145/357401.357402
[7] Jerome H. Saltzer, David P. Reed, and David D. Clark. End-To-End Arguments in System Design. ACM 计算机系统学报,第 2 卷,第 4 期,第 277–288 页,1984 年 11 月。doi.1145/357401.357402

[8] Peter Bailis and Kyle Kingsbury.The Network Is Reliable.ACM Queue, volume 12, issue 7, pages 48-55, July 2014.doi.1145/2639988.2639988
[8] Peter Bailis and Kyle Kingsbury. The Network Is Reliable. ACM Queue,第 12 卷,第 7 期,第 48-55 页,2014 年 7 月。doi.1145/2639988.2639988

[9] Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish.Taming Uncertainty in Distributed Systems with Help from the Network. At 10th European Conference on Computer Systems (EuroSys), April 2015.doi.1145/2741948.2741976
[9] Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera 和 Michael Walfish. 借助网络之力驯服分布式系统中的不确定性. 在第 10 届欧洲计算机系统会议(EuroSys)上,2015 年 4 月. doi.1145/2741948.2741976

[10] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan.Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. At ACM SIGCOMM Conference, August 2011.doi.1145/2018436.2018477
[10] Phillipa Gill, Navendu Jain 和 Nachiappan Nagappan. 理解数据中心中的网络故障:测量、分析与影响. 在 ACM SIGCOMM 会议上,2011 年 8 月. doi.1145/2018436.2018477

[11] Urs Hölzle.But recently a farmer had started grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough to cause a blip. x.com, May 2020. Archived at perma.cc/WX8X-ZZA5
[11] Urs Hölzle. 但最近附近有位农民开始放牧一群牛,每当它们踩到光纤链接上时,就会弯曲它到足以引起信号干扰的程度. x.com, 2020 年 5 月. 存档于 perma.cc / WX8X - ZZA5

[12] CBC News.Hundreds lose internet service in northern B.C. after beaver chews through cable. cbc.ca, April 2021. Archived at perma.cc/UW8C-H2MY
[12] CBC News. 加拿大不列颠哥伦比亚省北部数百人因河狸啃咬电缆而失去互联网服务. cbc.ca, 2021 年 4 月. 存档于 perma.cc / UW8C - H2MY

[13] Will Oremus.The Global Internet Is Being Attacked by Sharks, Google Confirms. slate.com, August 2014. Archived at perma.cc/P6F3-C6YG
[13] Will Oremus. 全球互联网正遭受鲨鱼攻击,谷歌证实。slate.com,2014 年 8 月。存档于 perma.cc / P6F3 - C6YG

[14] Jess Auerbach Jahajeeah.Down to the wire: The ship fixing our internet. continent.substack.com, November 2023. Archived at perma.cc/DP7B-EQ7S
[14] Jess Auerbach Jahajeeah. 紧急修复:修复我们互联网的船只。continent.substack.com,2023 年 11 月。存档于 perma.cc / DP7B - EQ7S

[15] Santosh Janardhan.More details about the October 4 outage. engineering.fb.com, October 2021. Archived at perma.cc/WW89-VSXH
[15] Santosh Janardhan. 10 月 4 日故障的更多细节。engineering.fb.com,2021 年 10 月。存档于 perma.cc/WW89-VSXH

[16] Tom Parfitt.Georgian woman cuts off web access to whole of Armenia. theguardian.com, April 2011. Archived at perma.cc/KMC3-N3NZ
[16] Tom Parfitt. 格鲁吉亚妇女切断亚美尼亚全国的网络访问。theguardian.com,2011 年 4 月。存档于 perma.cc/KMC3-N3NZ

[17] Antonio Voce, Tural Ahmedzade and Ashley Kirk.‘Shadow fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?theguardian.com, March 2025. Archived at perma.cc/HA7S-ZDBV
[17] 安东尼奥・沃塞,图尔・阿赫梅兹泽德和阿什利・柯克. “影子舰队” 与水下破坏:欧洲的海底互联网电缆是否正遭受攻击?theguardian.com,2025 年 3 月。存档于 perma.cc / HA7S - ZDBV

[18] Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolić.XFT: Practical Fault Tolerance beyond Crashes. At 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), November 2016.
[18] 刘胜云,保罗・维奥蒂,克里斯蒂安・卡钦,薇薇安・凯马和马克奥・武科利奇. XFT:崩溃之外的实用容错性。在第 12 届 USENIX 操作系统设计与实现研讨会(OSDI),2016 年 11 月。

[19] Mark Imbriaco.Downtime last Saturday.github.blog, December 2012. Archived at perma.cc/M7X5-E8SQ
[19] 马克・伊姆布里亚科. 上周六的停机时间。github.blog,2012 年 12 月。存档于 perma.cc / M7X5 - E8SQ

[20] Tom Lianza and Chris Snook.A Byzantine failure in the real world. blog.cloudflare.com, November 2020. Archived at perma.cc/83EZ-ALCY
[20] 汤姆・里亚查和克里斯・斯诺克. 真实世界中的拜占庭式失败。blog.cloudflare.com,2020 年 11 月。存档于 perma.cc / 83EZ - ALCY

[21] Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany.Toward a Generic Fault Tolerance Technique for Partial Network Partitioning. At 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), November 2020.
[21] Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan 和 Samer Al-Kiswany. 面向部分网络分区通用容错技术。第 14 届 USENIX 操作系统设计与实现研讨会(OSDI),2020 年 11 月。

[22] Marc A. Donges.Re: bnx2 cards Intermittantly Going Offline. Message to Linux netdev mailing list, spinics.net, September 2012. Archived at perma.cc/TXP6-H8R3
[22] Marc A. Donges. Re: bnx2 cards Intermittently Going Offline. Linux netdev 邮件列表消息,spinics.net,2012 年 9 月。存档于 perma.cc/TXP6-H8R3

[23] Troy Toman.Inside a CODE RED: Network Edition. signalvnoise.com, September 2020. Archived at perma.cc/BET6-FY25
[23] Troy Toman. 代码红色:网络版内幕。signalvnoise.com,2020 年 9 月。存档于 perma.cc/BET6-FY25

[24] Kyle Kingsbury.Call Me Maybe: Elasticsearch. aphyr.com, June 2014.perma.cc/JK47-S89J
[24] Kyle Kingsbury. 给我打电话好吗:Elasticsearch。aphyr.com,2014 年 6 月。perma.cc/JK47-S89J

[25] Salvatore Sanfilippo.A Few Arguments About Redis Sentinel Properties and Fail Scenarios. antirez.com, October 2014.perma.cc/8XEU-CLM8
[25] Salvatore Sanfilippo. 关于 Redis Sentinel 属性和故障场景的几点讨论. antirez.com, 2014 年 10 月. perma.cc / 8XEU - CLM8

[26] Nicolas Liochon.CAP: If All You Have Is a Timeout, Everything Looks Like a Partition. blog.thislongrun.com, May 2015. Archived at perma.cc/FS57-V2PZ
[26] Nicolas Liochon. CAP: 如果你只有超时,所有东西都像分区. blog.thislongrun.com, 2015 年 5 月. 存档于 perma.cc/FS57-V2PZ

[27] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft.Queues Don’t Matter When You Can JUMP Them! At 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.
[27] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, 和 Jon Crowcroft. 当你可以跳过队列时,队列就不重要了!在第 12 届 USENIX 网络系统设计与实现 symposium (NSDI), 2015 年 5 月.

[28] Theo Julienne.Debugging network stalls on Kubernetes. github.blog, November 2019. Archived at perma.cc/K9M8-XVGL
[28] Theo Julienne. 在 Kubernetes 上调试网络停滞. github.blog, 2019 年 11 月. 存档于 perma.cc / K9M8 - XVGL

[29] Guohui Wang and T. S. Eugene Ng.The Impact of Virtualization on Network Performance of Amazon EC2 Data Center. At 29th IEEE International Conference on Computer Communications (INFOCOM), March 2010.doi.1109/INFCOM.2010.5461931
[29] 王国辉和 T. S. 埃德温・杨. 虚拟化对 Amazon EC2 数据中心网络性能的影响. 在第 29 届 IEEE 国际计算机通信会议(INFOCOM),2010 年 3 月. doi.1109/INFCOM.2010.5461931

[30] Brandon Philips.etcd: Distributed Locking and Service Discovery. At Strange Loop, September 2014.
[30] 布兰登・菲利普斯. etcd:分布式锁和服务的发现. 在 Strange Loop,2014 年 9 月.

[31] Steve Newman.A Systematic Look at EC2 I/O.blog.scalyr.com, October 2012. Archived at perma.cc/FL4R-H2VE
[31] 史蒂夫・纽曼. 系统地看待 EC2 I/O. blog.scalyr.com,2012 年 10 月. 归档于 perma.cc/FL4R - H2VE

[32] Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. The ϕ Accrual Failure Detector. Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. Archived at perma.cc/NSM2-TRYA
[32] 早川直弘、夏尔・德法戈、拉米・亚拉德和高田健也. ϕ 累积故障检测器. 日本先进科学技术学院信息科学学院,技术报告 IS-RR-2004-010,2004 年 5 月. 归档于 perma.cc/NSM2-TRYA

[33] Jeffrey Wang.Phi Accrual Failure Detector. ternarysearch.blogspot.co.uk, August 2013.perma.cc/L452-AMLV
[33] Jeffrey Wang. Phi Accrual Failure Detector. ternarysearch.blogspot.co.uk, 2013 年 8 月. perma.cc/L452-AMLV

[34] Srinivasan Keshav. An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6
[34] Srinivasan Keshav. An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network. Addison-Wesley Professional, 1997 年 5 月. ISBN: 978-0-201-63442-6

[35] Othmar Kyas. ATM Networks. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

[36] Mellanox Technologies.InfiniBand FAQ, Rev 1.3. network.nvidia.com, December 2014. Archived at perma.cc/LQJ4-QZVK
[36] Mellanox Technologies. InfiniBand FAQ, Rev 1.3. network.nvidia.com, 2014 年 12 月. Archived at perma.cc/LQJ4-QZVK

[37] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman.End-to-End Congestion Control for InfiniBand. At 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359.doi.1109/INFCOM.2003.1208949
[37] Jose Renato Santos、Yoshio Turner 和 G.(John) Janakiraman。面向 InfiniBand 的端到端拥塞控制。在 IEEE 计算机与通信协会联合年会(INFOCOM)第 22 届会议,2003 年 4 月。同时发表于 HP 实验室帕洛阿尔托,技术报告 HPL-2002-359。doi.1109/INFCOM.2003.1208949

[38] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. At ACM Symposium on Cloud Computing (SOCC), November 2014.doi.1145/2670979.2670988
[38] Jialin Li、Naveen Kr. Sharma、Dan R. K. Ports 和 Steven D. Gribble。尾巴的故事:硬件、操作系统和应用层尾延迟的来源。在 ACM 云计算研讨会(SOCC),2014 年 11 月。doi.1145/2670979.2670988

[39] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley.The NTP FAQ and HOWTO. ntp.org, November 2006.
[39] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. 《NTP 常见问题解答和操作指南》. ntp.org, 2006 年 11 月.

[40] John Graham-Cumming.How and why the leap second affected Cloudflare DNS. blog.cloudflare.com, January 2017. Archived at archive.org
[40] John Graham-Cumming. 《闰秒如何影响 Cloudflare DNS 及其原因》. blog.cloudflare.com, 2017 年 1 月. 已存档于 archive.org

[41] David Holmes.Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows. blogs.oracle.com, October 2006. Archived at archive.org
[41] David Holmes. 《热点虚拟机内部:时钟、计时器和调度事件 —— 第一部分:Windows》. blogs.oracle.com, 2006 年 10 月. 已存档于 archive.org

[42] Joran Dirk Greef.Three Clocks are Better than One. tigerbeetle.com, August 2021. Archived at perma.cc/5RXG-EU6B
[42] Joran Dirk Greef. 《三个时钟胜过一个》. tigerbeetle.com, 2021 年 8 月. 已存档于 perma.cc / 5RXG - EU6B

[43] Oliver Yang.Pitfalls of TSC usage.oliveryang.net, September 2015. Archived at perma.cc/Z2QY-5FRA
[43] Oliver Yang. TSC 使用的陷阱。oliveryang.net,2015 年 9 月。存档于 perma.cc / Z2QY - 5FRA

[44] Steve Loughran.Time on Multi-Core, Multi-Socket Servers. steveloughran.blogspot.co.uk, September 2015. Archived at perma.cc/7M4S-D4U6
[44] Steve Loughran. 多核、多插槽服务器上的时间。steveloughran.blogspot.co.uk,2015 年 9 月。存档于 perma.cc / 7M4S - D4U6

[45] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.Spanner: Google’s Globally-Distributed Database. At 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.
[45] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, 和 Ruth Wang. Spanner:Google 的全球分布式数据库。在第十届 USENIX 操作系统设计与实现 symposium (OSDI),2012 年 10 月。

[46] M. Caporaloni and R. Ambrosini.How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?European Journal of Physics, volume 23, issue 4, pages L17–L21, June 2012.doi.1088/0143-0807/23/4/103
[46] M. Caporaloni 和 R. Ambrosini. 个人计算机时钟能通过互联网多紧密地跟踪 UTC 时间尺度?欧洲物理杂志,第 23 卷,第 4 期,第 L17–L21 页,2012 年 6 月。doi.1088/0143-0807/23/4/103

[47] Nelson Minar.A Survey of the NTP Network.alumni.media.mit.edu, December 1999. Archived at perma.cc/EV76-7ZV3
[47] Nelson Minar. NTP 网络的综述. alumni.media.mit.edu, 1999 年 12 月. 永久存档于 perma.cc/EV76-7ZV3

[48] Viliam Holub.Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem. blog.rapid7.com, March 2014. Archived at perma.cc/N3RV-5LNL
[48] Viliam Holub. Cassandra 集群中的时钟同步 Pt. 1 – 问题. blog.rapid7.com, 2014 年 3 月. 永久存档于 perma.cc / N3RV - 5LNL

[49] Poul-Henning Kamp.The One-Second War (What Time Will You Die?) ACM Queue, volume 9, issue 4, pages 44–48, April 2011.doi.1145/1966989.1967009
[49] Poul-Henning Kamp. 一秒之战争(你何时会死?). ACM Queue, 第 9 卷, 第 4 期, 第 44-48 页, 2011 年 4 月. doi.1145/1966989.1967009

[50] Nelson Minar.Leap Second Crashes Half the Internet. somebits.com, July 2012. Archived at perma.cc/2WB8-D6EU
[50] Nelson Minar. 阳历秒导致半个互联网崩溃. somebits.com, 2012 年 7 月. 永久存档于 perma.cc / 2WB8 - D6EU

[51] Christopher Pascoe.Time, Technology and Leaping Seconds. googleblog.blogspot.co.uk, September 2011. Archived at perma.cc/U2JL-7E74
[51] Christopher Pascoe. 时间、技术与闰秒跳跃. googleblog.blogspot.co.uk, 2011 年 9 月. 永久存档于 perma.cc / U2JL - 7E74

[52] Mingxue Zhao and Jeff Barr.Look Before You Leap – The Coming Leap Second and AWS. aws.amazon.com, May 2015. Archived at perma.cc/KPE9-XMFM
[52] Mingxue Zhao 和 Jeff Barr. 谨慎跳跃 —— 即将到来的闰秒与 AWS. aws.amazon.com, 2015 年 5 月. 永久存档于 perma.cc/KPE9-XMFM

[53] Darryl Veitch and Kanthaiah Vijayalayan.Network Timing and the 2015 Leap Second. At 17th International Conference on Passive and Active Measurement (PAM), April 2016.doi.1007/978-3-319-30505-9_29
[53] Darryl Veitch 和 Kanthaiah Vijayalayan. 网络计时与 2015 年闰秒. 在第 17 届被动与主动测量国际会议 (PAM), 2016 年 4 月. doi.1007/978-3-319-30505-9_29

[54] VMware, Inc.Timekeeping in VMware Virtual Machines. vmware.com, October 2008. Archived at perma.cc/HM5R-T5NF
[54] VMware, Inc. VMware 虚拟机中的计时. vmware.com, 2008 年 10 月. 永久存档于 perma.cc / HM5R - T5NF

[55] Victor Yodaiken.Clock Synchronization in Finance and Beyond. yodaiken.com, November 2017. Archived at perma.cc/9XZD-8ZZN
[55] Victor Yodaiken. 金融及其他领域的时钟同步. yodaiken.com, 2017 年 11 月. 存档于 perma.cc / 9XZD - 8ZZN

[56] Mustafa Emre Acer, Emily Stark, Adrienne Porter Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz.Where the Wild Warnings Are: Root Causes of Chrome HTTPS Certificate Errors. At ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1407–1420, October 2017.doi.1145/3133956.3134007
[56] Mustafa Emre Acer, Emily Stark, Adrienne Porter Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, 和 Parisa Tabriz. 野警告在哪里:Chrome HTTPS 证书错误的根本原因. 在 ACM SIGSAC 计算机与通信安全会议 (CCS), 第 1407-1420 页, 2017 年 10 月. doi.1145/3133956.3134007

[57] European Securities and Markets Authority.MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I.esma.europa.eu, Report ESMA/2015/1464, September 2015. Archived at perma.cc/ZLX9-FGQ3
[57] 欧洲证券和市场管理局. MiFID II / MiFIR:监管技术及实施标准 – 附件 I. esma.europa.eu, 报告 ESMA/2015/1464, 2015 年 9 月. 存档于 perma.cc/ZLX9-FGQ3

[58] Luke Bigum.Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1). catach.blogspot.com, November 2015. Archived at perma.cc/4J5W-FNM4
[58] Luke Bigum. 以最低花费解决 MiFID II 时钟同步 (第 1 部分). catach.blogspot.com, 2015 年 11 月. 存档于 perma.cc / 4J5W - FNM4

[59] Oleg Obleukhov and Ahmad Byagowi.How Precision Time Protocol is being deployed at Meta. engineering.fb.com, November 2022. Archived at perma.cc/29G6-UJNW
[59] Oleg Obleukhov 和 Ahmad Byagowi. 精度时间协议在 Meta 的部署情况。engineering.fb.com,2022 年 11 月。存档于 perma.cc / 29G6 - UJNW

[60] John Wiseman.gpsjam.org, July 2022.
[60] John Wiseman. gpsjam.org,2022 年 7 月。

[61] Josh Levinson, Julien Ridoux, and Chris Munns.It’s About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances. aws.amazon.com, November 2023. Archived at perma.cc/56M6-5VMZ
[61] Josh Levinson、Julien Ridoux 和 Chris Munns。时间就是一切:亚马逊 EC2 实例上的微秒级时钟。aws.amazon.com,2023 年 11 月。存档于 perma.cc / 56M6 - 5VMZ

[62] Kyle Kingsbury.Call Me Maybe: Cassandra.aphyr.com, September 2013. Archived at perma.cc/4MBR-J96V
[62] Kyle Kingsbury。给我打电话:Cassandra。aphyr.com,2013 年 9 月。存档于 perma.cc / 4MBR - J96V

[63] John Daily.Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems. riak.com, November 2013. Archived at perma.cc/4XB5-UCXY
[63] John Daily. 时钟是坏东西,或,欢迎来到分布式系统的奇妙世界。riak.com, 2013 年 11 月。存档于 perma.cc / 4XB5 - UCXY

[64] Marc Brooker.It’s About Time!brooker.co.za, November 2023. Archived at perma.cc/N6YK-DRPA
[64] Marc Brooker. 时间就是一切!brooker.co.za, 2023 年 11 月。存档于 perma.cc / N6YK - DRPA

[66] Leslie Lamport.Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, volume 21, issue 7, pages 558–565, July 1978.doi.1145/359545.359563
[66] Leslie Lamport. 分布式系统中的时间、时钟和事件排序。ACM 通讯,第 21 卷,第 7 期,第 558-565 页,1978 年 7 月。doi.1145/359545.359563

[67] Justin Sheehy.There Is No Now: Problems With Simultaneity in Distributed Systems. ACM Queue, volume 13, issue 3, pages 36–41, March 2015.doi.1145/2733108
[67] Justin Sheehy. 《没有现在:分布式系统中的同时性问题》. ACM Queue, 第 13 卷, 第 3 期, 第 36-41 页, 2015 年 3 月. doi.1145/2733108

[68] Murat Demirbas.Spanner: Google’s Globally-Distributed Database. muratbuffalo.blogspot.co.uk, July 2013. Archived at perma.cc/6VWR-C9WB
[68] Murat Demirbas. Spanner:Google 的全球分布式数据库. muratbuffalo.blogspot.co.uk, 2013 年 7 月. 永久存档于 perma.cc / 6VWR - C9WB

[69] Dahlia Malkhi and Jean-Philippe Martin.Spanner’s Concurrency Control. ACM SIGACT News, volume 44, issue 3, pages 73–77, September 2013.doi.1145/2527748.2527767
[69] Dahlia Malkhi 和 Jean-Philippe Martin. Spanner 的并发控制. ACM SIGACT News, 第 44 卷, 第 3 期, 第 73-77 页, 2013 年 9 月. doi.1145/2527748.2527767

[70] Franck Pachot.Achieving Precise Clock Synchronization on AWS. yugabyte.com, December 2024. Archived at perma.cc/UYM6-RNBS
[70] Franck Pachot. 在 AWS 上实现精确时钟同步. yugabyte.com, 2024 年 12 月. 归档于 perma.cc/UYM6-RNBS

[71] Spencer Kimball.Living Without Atomic Clocks: Where CockroachDB and Spanner diverge. cockroachlabs.com, January 2022. Archived at perma.cc/AWZ7-RXFT
[71] Spencer Kimball. 没有原子钟的生活:CockroachDB 和 Spanner 的分歧. cockroachlabs.com, 2022 年 1 月. 归档于 perma.cc/AWZ7-RXFT

[72] Murat Demirbas.Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases.muratbuffalo.blogspot.com, January 2025. Archived at perma.cc/9WNX-Q9U3
[72] Murat Demirbas. 分布式数据库中的时间使用(第 4 部分):生产数据库中的同步时钟. muratbuffalo.blogspot.com, 2025 年 1 月. 归档于 perma.cc / 9WNX - Q9U3

[73] Cary G. Gray and David R. Cheriton.Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency. At 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989.doi.1145/74850.74870
[73] Cary G. Gray 和 David R. Cheriton. 租约:一种高效的容错机制,用于分布式文件缓存一致性. 在第 12 届 ACM 操作系统原理研讨会(SOSP),1989 年 12 月. doi.1145/74850.74870

[74] Daniel Sturman, Scott Delap, Max Ross, et al.Roblox Return to Service. corp.roblox.com, January 2022. Archived at perma.cc/8ALT-WAS4
[74] Daniel Sturman、Scott Delap、Max Ross 等。《Roblox Return to Service》。corp.roblox.com,2022 年 1 月。存档于 perma.cc / 8ALT - WAS4

[75] Todd Lipcon.Avoiding Full GCs with MemStore-Local Allocation Buffers. slideshare.net, February 2011. Archived at https://perma.cc/CH62-2EWJ
[75] Todd Lipcon。《Avoiding Full GCs with MemStore-Local Allocation Buffers》。slideshare.net,2011 年 2 月。存档于 https://perma.cc/CH62-2EWJ

[76] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield.Live Migration of Virtual Machines. At 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.
[76] Christopher Clark、Keir Fraser、Steven Hand、Jacob Gorm Hansen、Eric Jul、Christian Limpach、Ian Pratt 和 Andrew Warfield。《虚拟机的实时迁移》。在第二届 USENIX 网络系统设计与实现研讨会(NSDI),2005 年 5 月。

[77] Mike Shaver.fsyncers and Curveballs. shaver.off.net, May 2008. Archived at archive.org
[77] Mike Shaver。《fsyncers and Curveballs》。shaver.off.net,2008 年 5 月。存档于 archive.org

[78] Zhenyun Zhuang and Cuong Tran.Eliminating Large JVM GC Pauses Caused by Background IO Traffic. engineering.linkedin.com, February 2016. Archived at perma.cc/ML2M-X9XT
[78] 庄振云和 Tran Cuong. 消除由后台 IO 流量引起的 JVM 大 GC 暂停. engineering.linkedin.com, 2016 年 2 月. 永久存档于 perma.cc / ML2M - X9XT

[79] Martin Thompson.Java Garbage Collection Distilled. mechanical-sympathy.blogspot.co.uk, July 2013. Archived at perma.cc/DJT3-NQLQ
[79] Martin Thompson. Java 垃圾回收精粹. mechanical-sympathy.blogspot.co.uk, 2013 年 7 月. 永久存档于 perma.cc/DJT3-NQLQ

[80] David Terei and Amit Levy.Blade: A Data Center Garbage Collector. arXiv.02578, April 2015.
[80] David Terei 和 Amit Levy. Blade: 一个数据中心垃圾回收器. arXiv.02578, 2015 年 4 月.

[81] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz.Trash Day: Coordinating Garbage Collection in Distributed Systems. At 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.
[81] Martin Maas, Tim Harris, Krste Asanović和 John Kubiatowicz. 废物回收日: 在分布式系统中协调垃圾回收. 在第 15 届 USENIX 操作系统热点研讨会(HotOS), 2015 年 5 月.

[82] Martin Fowler.The LMAX Architecture.martinfowler.com, July 2011. Archived at perma.cc/5AV4-N6RJ
[82] Martin Fowler. The LMAX Architecture. martinfowler.com, 2011 年 7 月. 永久存档于 perma.cc / 5AV4 - N6RJ

[83] Joseph Y. Halpern and Yoram Moses.Knowledge and common knowledge in a distributed environment. Journal of the ACM (JACM), volume 37, issue 3, pages 549–587, July 1990.doi.1145/79147.79161
[83] Joseph Y. Halpern 和 Yoram Moses. 分布式环境中的知识与共同知识. ACM 汇刊 (JACM), 第 37 卷, 第 3 期, 第 549-587 页, 1990 年 7 月. doi.1145/79147.79161

[84] Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, and Haibo Chen.Ad Hoc Transactions in Web Applications: The Good, the Bad, and the Ugly. At ACM International Conference on Management of Data (SIGMOD), June 2022.doi.1145/3514221.3526120
[84] Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan 和 Haibo Chen. Web 应用中的自举事务:好的、坏的与丑的. 在 ACM 国际数据管理会议 (SIGMOD), 2022 年 6 月. doi.1145/3514221.3526120

[85] Flavio P. Junqueira and Benjamin Reed.ZooKeeper: Distributed Process Coordination. O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[85] Flavio P. Junqueira 和 Benjamin Reed. ZooKeeper:分布式进程协调. O’Reilly Media, 2013 年. ISBN: 978-1-449-36130-3

[86] Enis Söztutar.HBase and HDFS: Understanding Filesystem Usage in HBase. At HBaseCon, June 2013. Archived at perma.cc/4DXR-9P88
[86] Enis Söztutar. HBase 和 HDFS:理解 HBase 中的文件系统使用情况。在 HBaseCon 会议上,2013 年 6 月。存档于 perma.cc / 4DXR - 9P88

[87] SUSE LLC.SUSE Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH.documentation.suse.com, March 2025. Archived at perma.cc/8LAR-EL9D
[87] SUSE LLC. SUSE Linux Enterprise High Availability 15 SP6 管理指南,第 12 节:Fencing 和 STONITH。documentation.suse.com,2025 年 3 月。存档于 perma.cc / 8LAR - EL9D

[88] Mike Burrows.The Chubby Lock Service for Loosely-Coupled Distributed Systems. At 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.
[88] Mike Burrows. 松散耦合分布式系统的 Chubby 锁服务。在 USENIX 第 7 届操作系统设计与实现研讨会(OSDI),2006 年 11 月。

[89] Kyle Kingsbury.etcd 3.4.3. jepsen.io, January 2020. Archived at perma.cc/2P3Y-MPWU

[90] Ensar Basri Kahveci.Distributed Locks are Dead; Long Live Distributed Locks!hazelcast.com, April 2019. Archived at perma.cc/7FS5-LDXE
[90] Ensar Basri Kahveci. 分布式锁已死;分布式锁万岁!hazelcast.com,2019 年 4 月。存档于 perma.cc / 7FS5 - LDXE

[91] Martin Kleppmann.How to do distributed locking. martin.kleppmann.com, February 2016. Archived at perma.cc/Y24W-YQ5L

[92] Salvatore Sanfilippo.Is Redlock safe?antirez.com, February 2016. Archived at perma.cc/B6GA-9Q6A

[93] Gunnar Morling.Leader Election With S3 Conditional Writes. www.morling.dev, August 2024. Archived at perma.cc/7V2N-J78Y

[94] Leslie Lamport, Robert Shostak, and Marshall Pease.The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems (TOPLAS), volume 4, issue 3, pages 382–401, July 1982.doi.1145/357172.357176
[94] 莱斯利・兰伯特、罗伯特・肖斯塔克和马歇尔・皮斯。拜占庭将军问题。ACM 计算机程序设计语言与系统杂志(TOPLAS),第 4 卷,第 3 期,第 382–401 页,1982 年 7 月。doi.1145/357172.357176

[95] Jim N. Gray.Notes on Data Base Operating Systems. in Operating Systems: An Advanced Course, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. Archived at perma.cc/7S9M-2LZU
[95] 吉姆・N・格雷。数据库操作系统笔记。在《操作系统:高级课程》,计算机科学讲义,第 60 卷,由 R. 巴伊尔、R. M. 格雷厄姆和 G. 塞格缪勒编辑,第 393–481 页,Springer-Verlag,1978 年。ISBN: 978-3-540-08755-7。存档于 perma.cc / 7S9M - 2LZU

[96] Brian Palmer.How Complicated Was the Byzantine Empire?slate.com, October 2011. Archived at perma.cc/AN7X-FL3N
[96] 布赖恩・帕尔默。拜占庭帝国有多复杂?slate.com,2011 年 10 月。存档于 perma.cc / AN7X - FL3N

[97] Leslie Lamport.My Writings.lamport.azurewebsites.net, December 2014. Archived at perma.cc/5NNM-SQGR
[97] 莱斯利・兰伯特。我的著作。lamport.azurewebsites.net,2014 年 12 月。存档于 perma.cc / 5NNM - SQGR

[98] John Rushby.Bus Architectures for Safety-Critical Embedded Systems. At 1st International Workshop on Embedded Software (EMSOFT), October 2001.doi.1007/3-540-45449-7_22
[98] John Rushby. 用于安全关键嵌入式系统的总线架构. 在首届嵌入式软件国际研讨会(EMSOFT)上,2001 年 10 月. doi.1007/3-540-45449-7_22

[99] Jake Edge.ELC: SpaceX Lessons Learned. lwn.net, March 2013. Archived at perma.cc/AYX8-QP5X
[99] Jake Edge. ELC: SpaceX 经验教训. lwn.net, 2013 年 3 月. 存档于 perma.cc/AYX8-QP5X

[100] Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis.SoK: Consensus in the Age of Blockchains. At 1st ACM Conference on Advances in Financial Technologies (AFT), October 2019.doi.1145/3318041.3355458
[100] Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, 和 George Danezis. SoK: 区块链时代的共识. 在首届 ACM 金融技术进步会议(AFT)上,2019 年 10 月. doi.1145/3318041.3355458

[101] Ezra Feilden, Adi Oltean, and Philip Johnston.Why we should train AI in space. White Paper, starcloud.com, September 2024. Archived at perma.cc/7Y3S-8UB6
[101] Ezra Feilden, Adi Oltean, 和 Philip Johnston. 我们为何应在太空中训练 AI. 白皮书, starcloud.com, 2024 年 9 月. 存档于 perma.cc / 7Y3S - 8UB6

[102] James Mickens.The Saddest Moment. USENIX;login, May 2013. Archived at perma.cc/T7BZ-XCFR
[102] James Mickens. 最悲伤的时刻. USENIX;login, 2013 年 5 月. 存档于 perma.cc / T7BZ - XCFR

[103] Martin Kleppmann and Heidi Howard.Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases. arxiv.org, December 2020.doi.48550/arXiv.2012.00472
[103] Martin Kleppmann 和 Heidi Howard. 拜占庭最终一致性与点对点数据库的基本限制.arxiv.org, 2020 年 12 月. doi.48550/arXiv.2012.00472

[104] Martin Kleppmann.Making CRDTs Byzantine Fault Tolerant. At 9th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC), April 2022.doi.1145/3517209.3524042
[104] Martin Kleppmann. 使 CRDTs 具有拜占庭容错性. 在第 9 届分布式数据一致性原理与实践研讨会 (PaPoC), 2022 年 4 月. doi.1145/3517209.3524042

[105] Evan Gilman.The Discovery of Apache ZooKeeper’s Poison Packet. pagerduty.com, May 2015. Archived at perma.cc/RV6L-Y5CQ
[105] Evan Gilman. Apache ZooKeeper 的毒包发现. pagerduty.com, 2015 年 5 月. 存档于 perma.cc / RV6L - Y5CQ

[106] Jonathan Stone and Craig Partridge.When the CRC and TCP Checksum Disagree. At ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), August 2000.doi.1145/347059.347561
[106] Jonathan Stone 和 Craig Partridge. 当 CRC 和 TCP 校验和发生冲突时. 在 ACM 计算机通信应用、技术、架构和协议会议 (SIGCOMM), 2000 年 8 月. doi.1145/347059.347561

[107] Evan Jones.How Both TCP and Ethernet Checksums Fail. evanjones.ca, October 2015. Archived at perma.cc/9T5V-B8X5
[107] Evan Jones. TCP 和以太网校验和如何都失败. evanjones.ca, 2015 年 10 月. 存档于 perma.cc / 9T5V - B8X5

[108] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer.Consensus in the Presence of Partial Synchrony. Journal of the ACM, volume 35, issue 2, pages 288–323, April 1988. doi.1145/42282.42283
[108] Cynthia Dwork, Nancy Lynch, 和 Larry Stockmeyer. 部分同步下的共识. ACM 杂志, 卷 35, 期 2, 页 288–323, 1988 年 4 月. doi.1145/42282.42283

[109] Richard D. Schlichting and Fred B. Schneider.Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems (TOCS), volume 1, issue 3, pages 222–238, August 1983.doi.1145/357369.357371
[109] Richard D. Schlichting 和 Fred B. Schneider. 故障停止处理器:设计容错计算系统的方法. ACM 计算机系统交易 (TOCS), 卷 1, 期 3, 页 222–238, 1983 年 8 月. doi.1145/357369.357371

[110] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi.Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems. At 4th ACM Symposium on Cloud Computing (SoCC), October 2013.doi.1145/2523616.2523627
[110] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, 和 Haryadi S. Gunawi. Limplock: 理解 Limpware 对可扩展云系统的影响. 在第 4 届 ACM 云计算研讨会 (SoCC), 2013 年 10 月. doi.1145/2523616.2523627

[111] Josh Snyder and Joseph Lynch.Garbage collecting unhealthy JVMs, a proactive approach. Netflix Technology Blog,netflixtechblog.medium.com, November 2019. Archived at perma.cc/8BTA-N3YB
[111] Josh Snyder 和 Joseph Lynch. 健康状况不佳的 JVM 垃圾回收,一种主动方法. Netflix 技术博客, netflixtechblog.medium.com, 2019 年 11 月. 保存在 perma.cc / 8BTA - N3YB

[112] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li.Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. At 16th USENIX Conference on File and Storage Technologies, February 2018.
[112] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, 和 Huaicheng Li. Fail-Slow at Scale: 大型生产系统中硬件性能故障的证据. 在第 16 届 USENIX 文件和存储技术会议, 2018 年 2 月.

[113] Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. At 16th Workshop on Hot Topics in Operating Systems (HotOS), May 2017.doi.1145/3102980.3103005
[113] 彭黄,郭传雄,周立冬,Jacob R. Lorch,党英农,Chintalapati Murali,和 Yao Randolph。灰失败:云规模系统的阿喀琉斯之踵。在第 16 届操作系统热点研讨会(HotOS),2017 年 5 月。doi.1145/3102980.3103005

[114] Chang Lou, Peng Huang, and Scott Smith.Understanding, Detecting and Localizing Partial Failures in Large System Software. At 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), February 2020.
[114] Lou Chang,彭黄,和 Scott Smith。理解、检测和定位大型系统软件的部分故障。在第 17 届 USENIX 网络系统设计与实现研讨会(NSDI),2020 年 2 月。

[115] Peter Bailis and Ali Ghodsi.Eventual Consistency Today: Limitations, Extensions, and Beyond. ACM Queue, volume 11, issue 3, pages 55-63, March 2013.doi.1145/2460276.2462076
[115] Peter Bailis 和 Ali Ghodsi。今天的最终一致性:局限性、扩展和超越。ACM Queue,第 11 卷,第 3 期,第 55-63 页,2013 年 3 月。doi.1145/2460276.2462076

[116] Bowen Alpern and Fred B. Schneider.Defining Liveness.Information Processing Letters, volume 21, issue 4, pages 181–185, October 1985.doi.1016/0020-0190(85)90056-0
[116] Bowen Alpern 和 Fred B. Schneider. 定义活性。Information Processing Letters, 卷 21, 第 4 期, 第 181–185 页, 1985 年 10 月。doi.1016/0020-0190(85)90056-0

[117] Flavio P. Junqueira.Dude, Where’s My Metadata?fpj.me, May 2015. Archived at perma.cc/D2EU-Y9S5
[117] Flavio P. Junqueira. 嘿,我的元数据在哪里?fpj.me, 2015 年 5 月。存档于 perma.cc / D2EU - Y9S5

[118] Scott Sanders.January 28th Incident Report. github.com, February 2016. Archived at perma.cc/5GZR-88TV
[118] Scott Sanders. 1 月 28 日事件报告。github.com, 2016 年 2 月。存档于 perma.cc / 5GZR - 88TV

[119] Jay Kreps.A Few Notes on Kafka and Jepsen. blog.empathybox.com, September 2013.perma.cc/XJ5C-F583
[119] Jay Kreps. 关于 Kafka 和 Jepsen 的几点笔记。blog.empathybox.com, 2013 年 9 月。perma.cc / XJ5C - F583

[120] Marc Brooker and Ankush Desai.Systems Correctness Practices at AWS.Queue, Volume 22, Issue 6, November/December 2024.doi.1145/3712057
[120] Marc Brooker 和 Ankush Desai. AWS 系统正确性实践. Queue, 第 22 卷, 第 6 期, 2024 年 11 月 / 12 月. doi.1145/3712057

[121] Andrey Satarin.Testing Distributed Systems: Curated list of resources on testing distributed systems. asatarin.github.io. Archived at perma.cc/U5V8-XP24
[121] Andrey Satarin. 测试分布式系统: 关于测试分布式系统的资源精选列表. asatarin.github.io. 归档于 perma.cc / U5V8 - XP24

[122] Jack Vanlightly.Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec.jack-vanlightly.com, December 2024. Archived at perma.cc/NSQ8-MQ5N
[122] Jack Vanlightly. 验证 Kafka 事务 - 日志条目 2 - 编写初始 TLA+ 规范. jack-vanlightly.com, 2024 年 12 月. 存档于 perma.cc/NSQ8-MQ5N

[123] Siddon Tang.From Chaos to Order — Tools and Techniques for Testing TiDB, A Distributed NewSQL Database. pingcap.com, April 2018. Archived at perma.cc/5EJB-R29F
[123] Siddon Tang. 从混沌到秩序 — 测试 TiDB 分布式 NewSQL 数据库的工具与技术. pingcap.com, 2018 年 4 月. 归档于 perma.cc / 5EJB - R29F

[124] Nathan VanBenschoten.Parallel Commits: An atomic commit protocol for globally distributed transactions. cockroachlabs.com, November 2019. Archived at perma.cc/5FZ7-QK6J
[124] Nathan VanBenschoten. 并行提交: 一种用于全球分布式事务的原子提交协议. cockroachlabs.com, 2019 年 11 月. 归档于 perma.cc / 5FZ7 - QK6J

[125] Jack Vanlightly.Paper: VR Revisited - State Transfer (part 3).jack-vanlightly.com, December 2022. Archived at perma.cc/KNK3-K6WS
[125] Jack Vanlightly. 论文: VR 重访 - 状态转移(第 3 部分). jack-vanlightly.com, 2022 年 12 月. 归档于 perma.cc/KNK3-K6WS

[126] Hillel Wayne.What if the spec doesn’t match the code?buttondown.com, March 2024. Archived at perma.cc/8HEZ-KHER
[126] Hillel Wayne. 如果规范与代码不匹配会怎样?buttondown.com,2024 年 3 月。存档于 perma.cc / 8HEZ - KHER

[127] Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu.Multi-Grained Specifications for Distributed System Model Checking and Verification. At 20th European Conference on Computer Systems (EuroSys), March 2025. doi.1145/3689031.3696069
[127] Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. 用于分布式系统模型检查和验证的多粒度规范。在第 20 届欧洲计算机系统会议(EuroSys)上,2025 年 3 月。doi.1145/3689031.3696069

[128] Yury Izrailevsky and Ariel Tseitlin.The Netflix Simian Army.netflixtechblog.com, July, 2011. Archived at perma.cc/M3NY-FJW6
[128] Yury Izrailevsky 和 Ariel Tseitlin. Netflix 的 Simian 军团。netflixtechblog.com,2011 年 7 月。存档于 perma.cc / M3NY - FJW6

[129] Kyle Kingsbury.Jepsen: On the perils of network partitions.aphyr.com, May, 2013. Archived at perma.cc/W98G-6HQP
[129] Kyle Kingsbury. Jepsen:关于网络分区之危害。aphyr.com,2013 年 5 月。存档于 perma.cc / W98G - 6HQP

[130] Kyle Kingsbury.Jepsen Analyses. jepsen.io, 2024. Archived at perma.cc/8LDN-D2T8
[130] Kyle Kingsbury. Jepsen Analyses. jepsen.io, 2024. 归档于 perma.cc / 8LDN - D2T8

[131] Rupak Majumdar and Filip Niksic.Why is random testing effective for partition tolerance bugs?Proceedings of the ACM on Programming Languages (PACMPL), volume 2, issue POPL, article no. 46, December 2017.doi.1145/3158134
[131] Rupak Majumdar 和 Filip Niksic. 为什么随机测试对分区容错性错误有效?ACM 编程语言会议论文集(PACMPL),第 2 卷,POPL 期,第 46 篇论文,2017 年 12 月。doi.1145/3158134

[132] FoundationDB project authors.Simulation and Testing.apple.github.io. Archived at perma.cc/NQ3L-PM4C
[132] FoundationDB 项目作者. 模拟与测试。apple.github.io. 归档于 perma.cc / NQ3L - PM4C

[133] Alex Kladov.Simulation Testing For Liveness. tigerbeetle.com, July 2023. Archived at perma.cc/RKD4-HGCR
[133] Alex Kladov. 活性模拟测试。tigerbeetle.com, 2023 年 7 月。归档于 perma.cc/RKD4-HGCR

[134] Alfonso Subiotto Marqués.(Mostly) Deterministic Simulation Testing in Go. polarsignals.com, May 2024. Archived at perma.cc/ULD6-TSA4
[134] Alfonso Subiotto Marqués. (主要) Go 语言中的确定性模拟测试。polarsignals.com,2024 年 5 月。存档于 perma.cc/ULD6-TSA4