ch7-sharding

20.7K
0
0
最后修改于

一个分布式数据库通常以两种方式分配数据:

  1. 在多个节点上保留相同数据的一个副本:这是复制
  2. 将大量数据拆分成更小的分片,并将不同的分片存储在不同的节点上

分片通常与复制结合使用,每个分片的副本都存储在多个节点上。

A node may store more than one shard. If a single-leader replication model is used, the combination of sharding and replication can look like Figure 7-1, for example. Each shard’s leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the leader for some shards and a follower for other shards, but each shard still only has one leader.
一个节点可能存储多个分片。如果使用单主节点复制模型,分片与复制的组合可能看起来像下图所示。每个分片的主节点被分配到一个节点,而其从节点被分配到其他节点。每个节点可能对某些分片是主节点,对其他分片是从节点,但每个分片仍然只有一个从节点。

分片的优缺点#

分片数据库的主要原因是可扩展性:当数据量或写入吞吐量过大,单个节点无法处理时,它是一种解决方案,分片使得数据和相关写入分散到多个节点上。

事实上,分片是我们实现水平扩展(即扩展架构)的主要工具之一:系统可以通过增加更多(更小)的机器来扩展其容量,而不是通过迁移到更强大的机器。如果你能够将工作负载划分得使每个分片处理大致相等的份额,那么你就可以将这些分片分配到不同的机器上,以便并行处理它们的数据和查询。

虽然复制在小型和大型规模上都有用,因为它能够实现容错和离线操作,但分片是一个重量级的解决方案,主要适用于大型规模。如果你的数据量和写入吞吐量可以在单台机器上处理(而如今单台机器已经能做很多事情了!),那么避免分片并坚持使用单分片数据库通常更好。

The reason for this recommendation is that sharding often adds complexity: you typically have to decide which records to put in which shard by choosing a partition key; all records with the same partition key are placed in the same shard [4]. This choice matters because accessing a record is fast if you know which shard it’s in, but if you don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme is difficult to change.
这个建议的原因是分片通常会添加复杂性:你通常需要通过选择一个分区键来决定哪些记录放入哪个分片;具有相同分区键的所有记录都会被放在同一个分片中 [4]。这个选择很重要,因为如果你知道记录在哪一个分片中,访问记录就会很快,但如果你不知道分片,就必须在所有分片中进行低效的搜索,而且分片方案很难更改。

Thus, sharding often works well for key-value data, where you can easily shard by key, but it’s harder with relational data where you may want to search by a secondary index, or join records that may be distributed across different shards. We will discuss this further in “Sharding and Secondary Indexes”.
因此,分片通常适用于键值数据,你可以轻松地按键进行分片,但在关系型数据中,你可能需要通过次级索引进行搜索,或者连接可能分布在不同分片中的记录。我们将在 “分片和次级索引” 中进一步讨论这一点。

Another problem with sharding is that a write may need to update related records in several different shards. While transactions on a single node are quite common (see Chapter 8), ensuring consistency across multiple shards requires a distributed transaction. As we shall see in Chapter 8, distributed transactions are available in some databases, but they are usually much slower than single-node transactions, may become a bottleneck for the system as a whole, and some systems don’t support them at all.
分片的另一个问题是,一个写入操作可能需要更新多个不同分片中相关的记录。虽然单节点上的事务非常常见(见第 8 章),但确保跨多个分片的 consistency 需要分布式事务。正如我们在第 8 章中将要看到的,某些数据库提供了分布式事务,但它们通常比单节点事务慢得多,可能会成为整个系统的瓶颈,而且有些系统根本不支持分布式事务。

Some systems use sharding even on a single machine, typically running one single-threaded process per CPU core to make use of the parallelism in the CPU, or to take advantage of a nonuniform memory access (NUMA) architecture in which some banks of memory are closer to one CPU than to others [5]. For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to spread load across CPU cores in the same machine [6].
一些系统即使在单台机器上也使用分片,通常每个 CPU 核心运行一个单线程进程,以利用 CPU 的并行性,或利用非均匀内存访问(NUMA)架构,其中一些内存组离某个 CPU 比离其他 CPU 更近 [5]。例如,Redis、VoltDB 和 FoundationDB 每个核心运行一个进程,并依赖分片来将负载分散到同一台机器的 CPU 核心上 [6]。

Sharding for Multitenancy 面向多租户的分片#

Software as a Service (SaaS) products and cloud services are often multitenant, where each tenant is a customer. Multiple users may have logins on the same tenant, but each tenant has a self-contained dataset that is separate from other tenants. For example, in an email marketing service, each business that signs up is typically a separate tenant, since one business’s newsletter signups, delivery data etc. are separate from those of other businesses.
软件即服务(SaaS)产品及云服务通常是多租户的,每个租户就是一个客户。多个用户可能属于同一租户,但每个租户都有一个独立的数据集,与其他租户的数据是分开的。例如,在电子邮件营销服务中,每个注册的企业通常都是一个独立的租户,因为一家企业的新订阅者信息、发送数据等都是与其他企业分开的。

Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate shard, or multiple small tenants may be grouped together into a larger shard. These shards might be physically separate databases (which we previously touched on in “Embedded storage engines”), or separately manageable portions of a larger logical database [7]. Using sharding for multitenancy has several advantages:
有时会使用分片来实现多租户系统:每个租户分配一个独立的分片,或者多个小型租户可以组合在一起形成一个较大的分片。这些分片可能是物理上独立的数据库(我们在 “嵌入式存储引擎” 中曾提及),或者是更大逻辑数据库中可单独管理的部分 [7]。使用分片实现多租户具有以下几个优势:

Resource isolation 资源隔离

If one tenant performs a computationally expensive operation, it is less likely that other tenants’ performance will be affected if they are running on different shards.
如果一个租户执行计算密集型操作,如果它们运行在不同的分片上,其他租户的性能受影响的可能性较小。

Permission isolation 权限隔离

If there is a bug in your access control logic, it’s less likely that you will accidentally give one tenant access to another tenant’s data if those tenants’ datasets are stored physically separately from each other.
如果你的访问控制逻辑存在漏洞,当不同租户的数据集在物理上分开存储时,你不太可能意外地让一个租户访问另一个租户的数据。

Cell-based architecture 基于单元格的架构

You can apply sharding not only at the data storage level, but also for the services running your application code. In a cell-based architecture, the services and storage for a particular set of tenants are grouped into a self-contained cell, and different cells are set up such that they can run largely independently from each other. This approach provides fault isolation: that is, a fault in one cell remains limited to that cell, and tenants in other cells are not affected [8].
你可以不仅在数据存储层面应用分片,还可以用于运行应用程序代码的服务。在基于单元格的架构中,特定一组租户的服务和存储被组合成一个自包含的单元格,不同的单元格被设置成可以基本上独立运行。这种方法提供了故障隔离:也就是说,一个单元格中的故障仅限于该单元格,而其他单元格中的租户不会受到影响 [8]。

Per-tenant backup and restore
租户级备份与恢复

Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a backup without affecting other tenants, which can be useful in case the tenant accidentally deletes or overwrites important data [9].
分别备份每个租户的分区,使得可以从备份中恢复租户的状态,而不会影响其他租户,这在租户意外删除或覆盖重要数据的情况下非常有用 [9]。

Regulatory compliance 合规性要求

Data privacy regulation such as the GDPR gives individuals the right to access and delete all data stored about them. If each person’s data is stored in a separate shard, this translates into simple data export and deletion operations on their shard [10].
数据隐私法规如 GDPR 赋予个人访问和删除所有关于他们的存储数据的权利。如果每个人的数据都存储在不同的分区内,这就转化为对其分区进行简单的数据导出和删除操作 [10]。

Data residence 数据驻留

If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a particular region.
如果某个特定租户的数据需要根据数据驻留法律存储在特定司法管辖区,区域感知数据库可以允许您将该租户的分区分配到特定区域。

Gradual schema rollout 渐进式模式发布

Schema migrations (previously discussed in “Schema flexibility in the document model”) can be rolled out gradually, one tenant at a time. This reduces risk, as you can detect problems before they affect all tenants, but it can be difficult to do transactionally [11].
模式迁移(在 “文档模型中的模式灵活性” 中讨论过)可以逐步推出,一次一个租户。这降低了风险,因为您可以在它们影响所有租户之前检测到问题,但它可能难以进行事务处理 [11]。

The main challenges around using sharding for multitenancy are:
围绕使用分片进行多租户的主要挑战是:

  • It assumes that each individual tenant is small enough to fit on a single node. If that is not the case, and you have a single tenant that’s too big for one machine, you would need to additionally perform sharding within a single tenant, which brings us back to the topic of sharding for scalability [12].
    它假设每个单个租户都足够小,可以放在一个节点上。如果情况不是这样,并且您有一个单个租户太大而无法容纳在一台机器上,您需要在一个租户内进行额外的分片,这又带回了关于分片以实现可扩展性的主题 [12]。
  • If you have many small tenants, then creating a separate shard for each one may incur too much overhead. You could group several small tenants together into a bigger shard, but then you have the problem of how you move tenants from one shard to another as they grow.
    如果您有多个小租户,那么为每个租户创建一个单独的分片可能会产生过多的开销。您可以把几个小租户组合在一起到一个更大的分片中,但这样您又面临一个问题,即当租户增长时,如何将它们从一个分片移动到另一个分片。
  • If you ever need to support features that connect data across multiple tenants, these become harder to implement if you need to join data across multiple shards.
    如果你需要支持连接跨多个租户的数据的功能,当需要连接跨多个分片的数据时,这些功能就变得更难实现。

Sharding of Key-Value Data 键值数据分片#

Say you have a large amount of data, and you want to shard it. How do you decide which records to store on which nodes?
假设你有一大量数据,并想对其进行分片。你该如何决定哪些记录存储在哪些节点上?

Our goal with sharding is to spread the data and the query load evenly across nodes. If every node takes a fair share, then—in theory—10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node (ignoring replication). Moreover, if we add or remove a node, we want to be able to rebalance the load so that it is evenly distributed across the 11 (when adding) or the remaining 9 (when removing) nodes.
分片的目标是将数据和查询负载均匀地分配到各个节点。如果每个节点都公平地分担负载,那么理论上 —— 在不考虑复制的情况下 ——10 个节点应该能够处理比单个节点多 10 倍的数据和读写吞吐量。此外,当我们添加或移除节点时,我们希望能够重新平衡负载,使其均匀地分布在 11 个节点(添加时)或剩余的 9 个节点(移除时)。

If the sharding is unfair, so that some shards have more data or queries than others, we call it skewed. The presence of skew makes sharding much less effective. In an extreme case, all the load could end up on one shard, so 9 out of 10 nodes are idle and your bottleneck is the single busy node. A shard with disproportionately high load is called a hot shard or hot spot. If there’s one key with a particularly high load (e.g., a celebrity in a social network), we call it a hot key.
如果分片不公平,导致某些分片比其他分片拥有更多的数据或查询,我们称之为倾斜。倾斜的存在会大大降低分片的有效性。在极端情况下,所有负载都可能集中在一个分片上,因此 10 个节点中有 9 个处于空闲状态,而瓶颈是那个唯一繁忙的节点。负载过高的分片被称为热点分片或热点。如果某个键的负载特别高(例如,社交网络中的名人),我们称之为热键。

Therefore we need an algorithm that takes as input the partition key of a record, and tells us which shard that record is in. In a key-value store the partition key is usually the key, or the first part of the key. In a relational model the partition key might be some column of a table (not necessarily its primary key). That algorithm needs to be amenable to rebalancing in order to relieve hot spots.
因此我们需要一个算法,它以记录的分区键作为输入,并告诉我们该记录位于哪个分片。在键值存储中,分区键通常是键本身,或者是键的第一部分。在关系模型中,分区键可能是表中的一列(不一定是主键)。该算法需要能够支持重新平衡,以缓解热点问题。

Sharding by Key Range 按键范围分片#

One way of sharding is to assign a contiguous range of partition keys (from some minimum to some maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in Figure 7-2. In this example, an entry’s partition key is its title. If you want to look up the entry for a particular title, you can easily determine which shard contains that entry by finding the volume whose key range contains the title you’re looking for, and thus pick the correct book off the shelf.
一种分片方式是为每个分片分配一个连续的分区键范围(从某个最小值到某个最大值),就像纸质百科全书那样,如图 7-2 所示。在这个例子中,条目的分区键是其标题。如果你要查找特定标题的条目,你可以通过找到包含你要查找标题的键范围的卷,从而轻松确定哪个分片包含该条目,然后从书架上取出正确的书籍。

ddia 0702

The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed. For example, in Figure 7-2, volume 1 contains words starting with A and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume per two letters of the alphabet would lead to some volumes being much bigger than others. In order to distribute the data evenly, the shard boundaries need to adapt to the data.
键的范围不一定均匀分布,因为你的数据可能不是均匀分布的。例如,在图 7-2 中,卷 1 包含以 A 和 B 开头的单词,但卷 12 包含以 T、U、V、W、X、Y 和 Z 开头的单词。仅仅每两个字母分配一个卷会导致某些卷远大于其他卷。为了均匀分布数据,分片边界需要适应数据。

The shard boundaries might be chosen manually by an administrator, or the database can choose them automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for example; the automatic variant is used by Bigtable, its open source equivalent HBase, the range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB [6]. YugabyteDB offers both manual and automatic tablet splitting.
分片边界可能由管理员手动选择,或者由数据库自动选择。例如,Vitess(一个用于 MySQL 的分片层)使用手动键范围分片;自动变体则被 Bigtable、其开源等价物 HBase、MongoDB、CockroachDB、RethinkDB 和 FoundationDB 中的基于范围的分片选项使用 [6]。YugabyteDB 提供手动和自动的表格分割。

Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in Chapter 4). This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one query (see “Multidimensional and Full-Text Indexes”). For example, consider an application that stores data from a network of sensors, where the key is the timestamp of the measurement. Range scans are very useful in this case, because they let you easily fetch, say, all the readings from a particular month.
在每个分片中,键以排序方式存储(例如,在 B 树或 SSTables 中,如第 4 章所述)。这有一个优点,即范围扫描很方便,并且你可以将键视为一个连接索引,以便在一个查询中获取多个相关记录(参见 “多维和全文索引”)。例如,考虑一个存储来自传感器网络数据的应用,其中键是测量时间戳。在这种情况下,范围扫描非常有用,因为它们让你可以轻松获取某个特定月份的所有读数。

A downside of key range sharding is that you can easily get a hot shard if there are a lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the database as the measurements happen, all the writes end up going to the same shard (the one for this month), so that shard can be overloaded with writes while others sit idle [13].
键范围分片的一个缺点是,如果附近的关键字有大量写入,很容易出现热点分片。例如,如果关键字是时间戳,那么分片对应于时间范围 —— 例如,每个月一个分片。不幸的是,如果你在测量发生时将传感器数据写入数据库,所有写入最终都会发送到同一个分片(即当前月份的分片),因此该分片可能会因写入过多而超载,而其他分片则处于空闲状态 [13]。

To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor ID so that the key ordering is first by sensor ID and then by timestamp. Assuming you have many sensors active at the same time, the write load will end up more evenly spread across the shards. The downside is that when you want to fetch the values of multiple sensors within a time range, you now need to perform a separate range query for each sensor.
为了避免传感器数据库中这个问题,你需要将时间戳以外的其他内容作为键的第一个元素。例如,你可以给每个时间戳前加上传感器 ID,这样键的排序会首先按传感器 ID,然后按时间戳。假设你同时有很多传感器在运行,写入负载最终会更均匀地分布在各个分片中。缺点是当你需要获取多个传感器在某个时间范围内的值时,现在需要为每个传感器执行单独的范围查询。

Rebalancing key-range sharded data 重新平衡键范围分片数据#

When you first set up your database, there are no key ranges to split into shards. Some databases, such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database, which is called pre-splitting. This requires that you already have some idea of what the key distribution is going to look like, so that you can choose appropriate key range boundaries [14].
当你首次设置数据库时,没有键范围可以拆分成分片。某些数据库,如 HBase 和 MongoDB,允许你在空数据库上配置初始的一组分片,这称为预拆分。这要求你事先对键的分布有一些了解,以便选择合适的键范围边界 [14]。

Later on, as your data volume and write throughput grow, a system with key-range sharding grows by splitting an existing shard into two or more smaller shards, each of which holds a contiguous sub-range of the original shard’s key range. The resulting smaller shards can then be distributed across multiple nodes. If large amounts of data are deleted, you may also need to merge several adjacent shards that have become small into one bigger one.This process is similar to what happens at the top level of a B-tree (see “B-Trees”).
后来,随着数据量和写入吞吐量的增长,键范围分片系统通过将一个现有分片拆分成两个或更多个较小的分片来扩展,每个较小的分片包含原始分片键范围的一个连续子范围。然后可以将这些较小的分片分布到多个节点上。如果删除了大量数据,你可能还需要将几个相邻的、已经变得较小的分片合并成一个更大的分片。这个过程类似于 B 树顶层发生的情况(见 “B 树”)。

With databases that manage shard boundaries automatically, a shard split is typically triggered by:
对于自动管理分片边界的数据库,分片拆分通常由以下情况触发:

  • the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
    分片达到配置的大小(例如,在 HBase 中,默认为 10 GB),或
  • in some systems, the write throughput being persistently above some threshold. Thus, a hot shard may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.
    在某些系统中,写入吞吐量持续高于某个阈值。因此,即使一个分片存储的数据不多,也可能被拆分,以便其写入负载可以更均匀地分布。

An advantage of key-range sharding is that the number of shards adapts to the data volume. If there is only a small amount of data, a small number of shards is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual shard is limited to a configurable maximum [15].
键范围分片的一个优点是分片数量会根据数据量进行调整。如果数据量较小,少量的分片就足够,因此开销较小;如果数据量巨大,每个分片的大小将限制在可配置的最大值 [15]。

A downside of this approach is that splitting a shard is an expensive operation, since it requires all of its data to be rewritten into new files, similarly to a compaction in a log-structured storage engine. A shard that needs splitting is often also one that is under high load, and the cost of splitting can exacerbate that load, risking it becoming overloaded.
这种方法的缺点是拆分分片是一个昂贵的操作,因为它需要将所有数据重写为新的文件,类似于日志结构化存储引擎中的压缩操作。需要拆分的分片通常是负载较高的分片,拆分的成本可能会加剧这种负载,导致其过载。

Sharding by Hash of Key 基于键的哈希分片#

Key-range sharding is useful if you want records with nearby (but different) partition keys to be grouped into the same shard; for example, this might be the case with timestamps. If you don’t care whether partition keys are near each other (e.g., if they are tenant IDs in a multitenant application), a common approach is to first hash the partition key before mapping it to a shard.
键范围分片在你希望具有相近(但不同)分区键的记录被分组到同一个分片中时很有用;例如,这可能适用于时间戳。如果你不在乎分区键是否彼此靠近(例如,如果它们是多租户应用中的租户 ID),一个常见的方法是先对分区键进行哈希处理,然后再将其映射到分片。

A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit hash function that takes a string. Whenever you give it a new string, it returns a seemingly random number between 0 and 2 32 − 1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers (but the same input always produces the same output).
一个好的哈希函数可以将倾斜数据均匀分布。假设你有一个 32 位的哈希函数,它接受一个字符串。每次你给它一个新的字符串时,它都会返回一个看似随机的 0 到 2^31-1 之间的数字。即使输入字符串非常相似,它们的哈希值也会均匀地分布在这个数字范围内(但相同的输入始终会产生相同的输出)。

For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash functions built in (as they are used for hash tables), but they may not be suitable for sharding: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes, making them unsuitable for sharding [16].
为了分片的目的,哈希函数不必是密码学安全的:例如,MongoDB 使用 MD5,而 Cassandra 和 ScyllaDB 使用 Murmur3。许多编程语言内置了简单的哈希函数(因为它们用于哈希表),但这些函数可能不适合用于分片:例如,在 Java 的 Object.hashCode() 和 Ruby 的 Object#hash 中,相同的键在不同进程中可能有不同的哈希值,因此不适合用于分片 [16]。

Hash modulo number of nodes 哈希取模节点数量#

Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought is to take the hash value modulo the number of nodes in the system (using the % operator in many programming languages). For example, hash (key) % 10 would return a number between 0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit). If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.
一旦你对键进行了哈希处理,该如何选择将其存储在哪个分片中?或许你的第一个想法是取哈希值的模系统中的节点数(在许多编程语言中使用 % 运算符)。例如,hash (key) % 10 会返回一个介于 0 到 9 之间的数字(如果我们把哈希值写成十进制数,hash % 10 将是最后一位)。如果我们有 10 个编号为 0 到 9 的节点,这似乎是一种简单的方法来将每个键分配给节点。

The problem with the mod N approach is that if the number of nodes N changes, most of the keys have to be moved from one node to another. Figure 7-3 shows what happens when you have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are 0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
模 N 方法的缺点是,如果节点数 N 发生变化,大多数键都需要从一个节点移动到另一个节点。图 7-3 展示了当你有三个节点并添加第四个节点时会发生什么。在重新平衡之前,节点 0 存储了哈希值为 0、3、6、9 等的键。在添加第四个节点后,哈希值为 3 的键移动到了节点 3,哈希值为 6 的键移动到了节点 2,哈希值为 9 的键移动到了节点 1,等等。

ddia 0703

The mod N function is easy to compute, but it leads to very inefficient rebalancing because there is a lot of unnecessary movement of records from one node to another. We need an approach that doesn’t move data around more than necessary.
取模 N 函数易于计算,但它会导致非常低效的再平衡,因为记录会在节点之间进行大量不必要的移动。我们需要一种不会比必要时更多地移动数据的方法。

Fixed number of shards 固定数量的分片#

One simple but widely-used solution is to create many more shards than there are nodes, and to assign several shards to each node. For example, a database running on a cluster of 10 nodes may be split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then stored in shard number hash (key) % 1,000, and the system separately keeps track of which shard is stored on which node.
一种简单但广泛使用的解决方案是创建比节点更多的分片,并将多个分片分配给每个节点。例如,一个在 10 节点集群上运行的数据库从一开始可能被分成 1000 个分片,这样每个节点分配 100 个分片。然后,将键存储在编号为 hash (key) % 1000 的分片中,系统会分别跟踪哪个分片存储在哪个节点上。

Now, if a node is added to the cluster, the system can reassign some of the shards from existing nodes to the new node until they are fairly distributed once again. This process is illustrated in Figure 7-4. If a node is removed from the cluster, the same happens in reverse.
现在,如果向集群添加节点,系统可以将一些现有节点的分片重新分配给新节点,直到它们再次公平分布。该过程如图 7-4 所示。如果从集群中移除节点,则发生相反的过程。

ddia 0704

In this model, only entire shards are moved between nodes, which is cheaper than splitting shards. The number of shards does not change, nor does the assignment of keys to shards. The only thing that changes is the assignment of shards to nodes. This change of assignment is not immediate—it takes some time to transfer a large amount of data over the network—so the old assignment of shards is used for any reads and writes that happen while the transfer is in progress.
在这种模型中,节点之间移动的是整个分片,这比拆分分片更便宜。分片数量不会改变,键到分片的分配也不会改变。唯一变化的是分片到节点的分配。这种分配变化不是立即发生的 —— 在网络中传输大量数据需要一些时间 —— 所以在传输进行期间,任何读写操作都会使用旧的分片分配。

It’s common to choose the number of shards to be a number that is divisible by many factors, so that the dataset can be evenly split across various different numbers of nodes—not requiring the number of nodes to be a power of 2, for example [4]. You can even account for mismatched hardware in your cluster: by assigning more shards to nodes that are more powerful, you can make those nodes take a greater share of the load.
通常选择分片数量为一个能被许多因素整除的数字,以便数据集可以均匀地分配到不同数量的节点上 —— 例如,不必要求节点数量是 2 的幂 。甚至可以考虑到集群中的硬件不匹配:通过给更强大的节点分配更多分片,可以让这些节点承担更大的负载。

This approach to sharding is used in Citus (a sharding layer for PostgreSQL), Riak, Elasticsearch, and Couchbase, among others. It works well as long as you have a good estimate of how many shards you will need when you first create the database. You can then add or remove nodes easily, subject to the limitation that you can’t have more nodes than you have shards.
这种分片方法被 Citus(PostgreSQL 的分片层)、Riak、Elasticsearch 和 Couchbase 等系统采用。只要你在创建数据库时能对所需分片数量有一个良好的估计,这种方法就非常有效。之后你可以轻松地添加或删除节点,但受限于节点数量不能超过分片数量的限制。

If you find the originally configured number of shards to be wrong—for example, if you have reached a scale where you need more nodes than you have shards—then an expensive resharding operation is required. It needs to split each shard and write it out to new files, using a lot of additional disk space in the process. Some systems don’t allow resharding while concurrently writing to the database, which makes it difficult to change the number of shards without downtime.
如果你发现最初配置的分片数量不正确 —— 例如,如果你已经达到需要比分片数量更多的节点的扩展规模 —— 那么就需要进行昂贵的重分片操作。这需要将每个分片分割并写入到新的文件中,过程中会消耗大量的额外磁盘空间。有些系统不允许在同时向数据库写入时进行重分片,这使得在不造成停机的情况下更改分片数量变得困难。

Choosing the right number of shards is difficult if the total size of the dataset is highly variable (for example, if it starts small but may grow much larger over time). Since each shard contains a fixed fraction of the total data, the size of each shard grows proportionally to the total amount of data in the cluster. If shards are very large, rebalancing and recovery from node failures become expensive. But if shards are too small, they incur too much overhead. The best performance is achieved when the size of shards is “just right,” neither too big nor too small, which can be hard to achieve if the number of shards is fixed but the dataset size varies.
如果数据集的总大小变化很大(例如,如果它开始时很小,但可能会随着时间的推移增长得更大),那么选择正确的分片数量会很困难。由于每个分片包含总数据的一定比例,每个分片的大小会随着集群中总数据量的增长而按比例增长。如果分片非常大,那么重新平衡和从节点故障中恢复会变得昂贵。但如果分片太小,它们会产生过多的开销。当分片的大小 “刚刚好”,既不太大也不太小的时候,性能最佳,但如果分片数量固定而数据集大小变化,要实现这一点可能很困难。

Sharding by hash range 按哈希范围分片#

If the required number of shards can’t be predicted in advance, it’s better to use a scheme in which the number of shards can adapt easily to the workload. The aforementioned key-range sharding scheme has this property, but it has a risk of hot spots when there are a lot of writes to nearby keys. One solution is to combine key-range sharding with a hash function so that each shard contains a range of hash values rather than a range of keys.
如果无法提前预测所需分片数量,最好使用一种能够轻松适应工作负载的分片方案。上述的按键范围分片方案具有这种特性,但在有大量写入邻近键的情况下存在热点风险。一种解决方案是将按键范围分片与哈希函数结合,使每个分片包含一系列哈希值而非一系列键。

Figure 7-5 shows an example using a 16-bit hash function that returns a number between 0 and 65,535 = 2 16 − 1 (in reality, the hash is usually 32 bits or more). Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly distributed across that range. We can then assign a range of hash values to each shard: for example, values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
图 7-5 展示了一个使用 16 位哈希函数的示例,该函数返回 0 到 65,535(即 2^16 - 1)之间的数值(实际上,哈希值通常是 32 位或更多)。即使输入键非常相似(例如,连续的时间戳),它们的哈希值也会均匀分布在该范围内。然后我们可以为每个分片分配一系列哈希值:例如,0 到 16,383 之间的值分配给分片 0,16,384 到 32,767 之间的值分配给分片 1,以此类推。

ddia 0705

Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number of shards adapts to the volume of data rather than being fixed in advance.
和键范围分片一样,当哈希范围分片中的分片变得过大或负载过重时,可以进行拆分。这仍然是一项昂贵的操作,但可以根据需要随时进行,因此分片数量会根据数据量进行调整,而不是预先固定。

The downside compared to key-range sharding is that range queries over the partition key are not efficient, as keys in the range are now scattered across all the shards. However, if keys consist of two or more columns, and the partition key is only the first of these columns, you can still perform efficient range queries over the second and later columns: as long as all records in the range query have the same partition key, they will be in the same shard.
与键范围分片相比的缺点是,对分区键的范围查询效率不高,因为范围内的键现在分散在所有分片中。然而,如果键由两列或多列组成,并且分区键只是这些列中的第一列,那么你仍然可以对第二列及以后的列执行高效的范围查询:只要范围内查询的所有记录都具有相同的分区键,它们就会位于同一个分片中。

Hash-range sharding is used in YugabyteDB and DynamoDB [17], and is an option in MongoDB. Cassandra and ScyllaDB use a variant of this approach that is illustrated in Figure 7-6: the space of hash values is split into a number of ranges proportional to the number of nodes (3 ranges per node in Figure 7-6, but actual numbers are 8 per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between those ranges. This means some ranges are bigger than others, but by having multiple ranges per node those imbalances tend to even out [15,18].
哈希范围分片用于 YugabyteDB 和 DynamoDB [17],也是 MongoDB 的一个选项。Cassandra 和 ScyllaDB 使用这种方法的变体,如图 7-6 所示:哈希值的空间被分成与节点数量成比例的多个范围(图 7-6 中每个节点 3 个范围,但 Cassandra 默认每个节点 8 个范围,ScyllaDB 每个节点 256 个范围),这些范围之间有随机边界。这意味着有些范围比其他范围大,但通过每个节点有多个范围,这些不平衡往往会得到平衡 [15, 18]。

ddia 0706

When nodes are added or removed, range boundaries are added and removed, and shards are split or merged accordingly [19]. In the example of Figure 7-6, when node 3 is added, node 1 transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to node 3. This has the effect of giving the new node an approximately fair share of the dataset, without transferring more data than necessary from one node to another.
当节点被添加或移除时,范围边界会被添加和移除,相应地分片会被拆分或合并 [19]。在图 7-6 的示例中,当添加节点 3 时,节点 1 将其两个范围的部分数据转移给节点 3,节点 2 将其一个范围的部分数据转移给节点 3。这的效果是使新节点获得数据集的大致公平份额,同时避免从一个节点向另一个节点转移过多数据。

Consistent hashing 一致性哈希#

A consistent hashing algorithm is a hash function that maps keys to a specified number of shards in a way that satisfies two properties:
一致性哈希算法是一种哈希函数,它将键映射到指定数量的分片中,并满足两个属性:

  1. the number of keys mapped to each shard is roughly equal, and
    每个分片中映射的键的数量大致相等,
  2. when the number of shards changes, as few keys as possible are moved from one shard to another.
    当分片数量发生变化时,应尽可能少地将键从一个分片移动到另一个分片。

Note that consistent here has nothing to do with replica consistency (see Chapter 6) or ACID consistency (see Chapter 8), but rather describes the tendency of a key to stay in the same shard as much as possible.
需要注意的是,这里的 "一致性" 与副本一致性(见第 6 章)或 ACID 一致性(见第 8 章)无关,而是描述一个键尽可能保持在同一个分片中。

The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [20], but several other consistent hashing algorithms have also been proposed [21], such as highest random weight, also known as rendezvous hashing [22], and jump consistent hash [23]. With Cassandra’s algorithm, if one node is added, a small number of existing shards are split into sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned individual keys that were previously scattered across all of the other nodes. Which one is preferable depends on the application.
Cassandra 和 ScyllaDB 使用的分片算法与一致性哈希的原始定义 [20] 相似,但也有人提出了其他一致性哈希算法 [21],例如最高随机权重,也称为会晤哈希 [22],以及跳跃一致性哈希 [23]。使用 Cassandra 的算法时,如果添加一个节点,少量现有的分片会被拆分成子范围;而使用会晤哈希和跳跃一致性哈希时,新节点会被分配到之前分散在所有其他节点中的独立键。哪种更优取决于应用。

Skewed Workloads and Relieving Hot Spots 倾斜负载和缓解热点#

Consistent hashing ensures that keys are uniformly distributed across nodes, but that doesn’t mean that the actual load is uniformly distributed. If the workload is highly skewed—that is, the amount of data under some partition keys is much greater than other keys, or if the rate of requests to some keys is much higher than to others—you can still end up with some servers being overloaded while others sit almost idle.
一致性哈希确保键在节点间均匀分布,但这并不意味着实际负载均匀分布。如果工作负载高度倾斜 —— 也就是说,某些分区键下的数据量远大于其他键,或者对某些键的请求率远高于对其他键的请求率 —— 你仍然可能某些服务器过载,而其他服务器几乎闲置。

For example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something [24]. This event can result in a large volume of reads and writes to the same key (where the partition key is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
例如,在社交媒体网站上,一个拥有数百万粉丝的名人用户在他们做某事时可能会引发一场活动 [24]。这一事件可能导致对同一个键的大量读写(其中分区键可能是名人的用户 ID,或者人们评论的动作的 ID)。

In such situations, a more flexible sharding policy is required [25,26]. A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [27].
在这种情况下,需要更灵活的分区策略 [25, 26]。一个基于键范围(或哈希范围)定义分区的系统,可以让单个热点键独立地放入一个分区,甚至可能为其分配一台专用机器 [27]。

It’s also possible to compensate for skew at the application level. For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different shards.
也可以在应用层面补偿倾斜问题。例如,如果某个键已知非常热门,一个简单的技术是在键的开头或结尾添加一个随机数。只需添加一个两位数的十进制随机数,就能将对该键的写入均匀分配到 100 个不同的键上,从而让这些键被分配到不同的分片。

However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it. The volume of reads to each shard of the hot key is not reduced; only the write load is split. This technique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track of which keys are being split, and a process for converting a regular key into a specially-managed hot key.
然而,虽然将写入分散到不同的键上,但任何读取操作现在都必须做额外的工作,因为它们需要从所有 100 个键中读取数据并将其合并。热门键的每个分片上的读取量并未减少;只有写入负载被分散了。这种技术还需要额外的记录工作:对于少数热门键添加随机数才有意义;对于绝大多数写入吞吐量低的键来说,这将是不必要的额外开销。因此,你还需要一种方法来跟踪哪些键正在被分散,以及一个将普通键转换为特殊管理的热门键的流程。

The problem is further compounded by change of load over time: for example, a particular social media post that has gone viral may experience high load for a couple of days, but thereafter it’s likely to calm down again. Moreover, some keys may be hot for writes while others are hot for reads, necessitating different strategies for handling them.
这个问题还因负载随时间的变化而更加复杂:例如,一条突然爆红的社交媒体帖子可能会在几天内承受高负载,但之后可能会再次趋于平静。此外,某些键可能写操作频繁(热点写),而另一些键可能读操作频繁(热点读),这就需要不同的策略来处理它们。

Some systems (especially cloud services designed for large scale) have automated approaches for dealing with hot shards; for example, Amazon calls it heat management [28] or adaptive capacity [17]. The details of how these systems work go beyond the scope of this book.
一些系统(尤其是为大规模设计的云服务)有自动处理热分片的方法;例如,亚马逊称之为热管理 [28] 或自适应容量 [17]。这些系统的工作原理的细节超出了本书的范围。

Operations: Automatic or Manual Rebalancing 操作:自动或手动再平衡#

There is one important question with regard to rebalancing that we have glossed over: does the splitting of shards and rebalancing happen automatically or manually?
关于重新平衡有一个重要的问题我们一直未提及:分片和重新平衡是自动进行还是手动进行?

Some systems automatically decide when to split shards and when to move them from one node to another, without any human interaction, while others leave sharding to be explicitly configured by an administrator. There is also a middle ground: for example, Couchbase and Riak generate a suggested shard assignment automatically, but require an administrator to commit it before it takes effect.
一些系统会自动决定何时分片以及何时将分片从一个节点移动到另一个节点,无需人工干预,而另一些系统则将分片工作留给管理员显式配置。也存在一种折中的做法:例如,Couchbase 和 Riak 会自动生成分片分配建议,但需要管理员提交后才会生效。

Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud databases such as DynamoDB are promoted as being able to automatically add and remove shards to adapt to big increases or decreases of load within a matter of minutes [17,29].
完全自动的重新平衡可以很方便,因为日常维护所需的操作工作量较少,而且这类系统甚至可以自动扩展以适应工作负载的变化。像 DynamoDB 这样的云数据库被宣传能够自动添加和删除分片,以在几分钟内适应负载的显著增加或减少 [17, 29]。

However, automatic shard management can also be unpredictable. Rebalancing is an expensive operation, because it requires rerouting requests and moving a large amount of data from one node to another. If it is not done carefully, this process can overload the network or the nodes, and it might harm the performance of other requests. The system must continue processing writes while the rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting process might not even be able to keep up with the rate of incoming writes [29].
然而,自动分片管理也可能不可预测。重新平衡是一项昂贵的操作,因为它需要重新路由请求并将大量数据从一个节点移动到另一个节点。如果处理不当,这个过程可能会使网络或节点过载,并可能损害其他请求的性能。系统必须在重新平衡过程中继续处理写入操作;如果系统接近其最大写入吞吐量,分片拆分过程甚至可能无法跟上传入写入的速率 [29]。

Such automation can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on other nodes and the network, making the situation worse. There is a risk of causing a cascading failure where other nodes become overloaded and are also falsely suspected of being down.
这种自动化与自动故障检测结合使用可能很危险。例如,假设一个节点过载,暂时响应请求缓慢。其他节点得出结论,认为过载的节点已死,并自动重新平衡集群以将负载移离它。这会给其他节点和网络带来额外的负载,使情况变得更糟。存在引发级联故障的风险,其中其他节点过载,并且也被错误地怀疑宕机。

For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.
因此,在重新平衡时让人类参与进来可能是个好主意。这比完全自动化的过程要慢,但它可以帮助防止运营中的意外情况。

Request Routing 请求路由#

We have discussed how to shard a dataset across multiple nodes, and how to rebalance those shards as nodes are added or removed. Now let’s move on to the question: if you want to read or write a particular key, how do you know which node—i.e., which IP address and port number—you need to connect to?
我们已经讨论了如何将数据集分片到多个节点,以及如何在节点添加或删除时重新平衡这些分片。现在让我们来探讨一个问题:如果你想要读取或写入特定的键,你如何知道需要连接到哪个节点 —— 即哪个 IP 地址和端口号?

We call this problem request routing, and it’s very similar to service discovery, which we previously discussed in “Load balancers, service discovery, and service meshes”. The biggest difference between the two is that with services running application code, each instance is usually stateless, and a load balancer can send a request to any of the instances. With sharded databases, a request for a key can only be handled by a node that is a replica for the shard containing that key.
我们将这个问题称为请求路由,它与我们在 “负载均衡器、服务发现和服务网格” 中之前讨论的服务发现非常相似。这两者之间的最大区别在于,由于服务运行应用程序代码,每个实例通常是状态 less 的,负载均衡器可以将请求发送到任何实例。而对于分片数据库,对键的请求只能由包含该键的分片的副本节点处理。

This means that request routing has to be aware of the assignment from keys to shards, and from shards to nodes. On a high level, there are a few different approaches to this problem (illustrated in Figure 7-7):
这意味着请求路由必须了解从键到分片的分配,以及从分片到节点的分配。从高层次来看,解决这个问题有几种不同的方法(如图 7-7 所示):

  1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node coincidentally owns the shard to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along to the client.
    允许客户端联系任意节点(例如,通过轮询负载均衡器)。如果该节点碰巧拥有与请求相关的分片,它可以直接处理该请求;否则,它将请求转发给适当的节点,接收回复,并将回复传递给客户端。
  2. Send all requests from clients to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier does not itself handle any requests; it only acts as a shard-aware load balancer.
    将所有来自客户端的请求首先发送到路由层,该层确定每个请求应处理的节点,并相应地转发。这个路由层本身不处理任何请求;它仅充当一个感知分片的负载均衡器。
  3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.
    要求客户端了解分片以及分片分配到节点的规则。在这种情况下,客户端可以直接连接到适当的节点,无需任何中介。
    ddia 0707

In all cases, there are some key problems:
在所有情况下,存在一些关键问题:

  • Who decides which shard should live on which node? It’s simplest to have a single coordinator making that decision, but in that case how do you make it fault-tolerant in case the node running the coordinator goes down? And if the coordinator role can failover to another node, how do you prevent a split-brain situation (see “Handling Node Outages”) where two different coordinators make contradictory shard assignments?
    谁决定哪个分片应该存在于哪个节点上?最简单的方法是让单个协调器做出这个决定,但在这种情况下,如果运行协调器的节点宕机,如何使其具有容错性?如果协调器角色可以切换到另一个节点,如何防止出现脑裂情况(参见 “处理节点故障”),即两个不同的协调器做出相互矛盾的分片分配?
  • How does the component performing the routing (which may be one of the nodes, or the routing tier, or the client) learn about changes in the assignment of shards to nodes?
    执行路由的组件(可能是节点之一、路由层或客户端)如何学习到分片分配到节点的变化?
  • While a shard is being moved from one node to another, there is a cutover period during which the new node has taken over, but requests to the old node may still be in flight. How do you handle those?
    当分片从一个节点移动到另一个节点时,存在一个切换期,新节点已经接管,但旧节点的请求可能仍在进行中。你如何处理这些情况?

Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to keep track of shard assignments, as illustrated in Figure 7-8. They use consensus algorithms (see Chapter 10) to provide fault tolerance and protection against split-brain. Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed, ZooKeeper notifies the routing tier so that it can keep its routing information up to date.
许多分布式数据系统依赖于单独的协调服务(如 ZooKeeper 或 etcd)来跟踪分片分配,如图 7-8 所示。它们使用共识算法(见第 10 章)来提供容错性和防止脑裂。每个节点在 ZooKeeper 中注册自己,ZooKeeper 维护分片到节点的权威映射。其他参与者,如路由层或感知分片的客户端,可以从 ZooKeeper 中订阅这些信息。每当分片变更所有权,或节点被添加或移除时,ZooKeeper 会通知路由层,以便它能够更新其路由信息。

ddia 0708

For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd to keep track of which service instance is running where. MongoDB has a similar architecture, but it relies on its own config server implementation and mongos daemons as the routing tier. Kafka, YugabyteDB, and TiDB use built-in implementations of the Raft consensus protocol to perform this coordination function.
例如,HBase 和 SolrCloud 使用 ZooKeeper 来管理分片分配,而 Kubernetes 使用 etcd 来跟踪哪个服务实例在何处运行。MongoDB 具有类似的架构,但它依赖于自己的配置服务器实现和 mongos 守护进程作为路由层。Kafka、YugabyteDB 和 TiDB 使用 Raft 共识协议的内置实现来执行此协调功能。

Cassandra, ScyllaDB, and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. This provides much weaker consistency than a consensus protocol; it is possible to have split brain, in which different parts of the cluster have different node assignments for the same shard. Leaderless databases can tolerate this because they generally make weak consistency guarantees anyway (see “Limitations of Quorum Consistency”).
Cassandra、ScyllaDB 和 Riak 采取了不同的方法:它们在节点之间使用广播协议来传播集群状态中的任何更改。这与共识协议相比提供了非常弱的容错性;可能存在脑分裂的情况,即集群的不同部分对同一分片具有不同的节点分配。无主数据库可以容忍这种情况,因为它们通常无论如何都提供弱容错保证(参见 “容错一致性的一致性限制”)。

When using a routing tier or when sending requests to a random node, clients still need to find the IP addresses to connect to. These are not as fast-changing as the assignment of shards to nodes, so it is often sufficient to use DNS for this purpose.
在使用路由层或向随机节点发送请求时,客户端仍然需要找到要连接的 IP 地址。这些地址不像分片分配给节点的分配那样频繁变化,因此通常使用 DNS 就足够了。

This discussion of request routing has focused on finding the shard for an individual key, which is most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they typically have a very different kind of query execution: rather than executing in a single shard, a query typically needs to aggregate and join data from many different shards in parallel. We will discuss techniques for such parallel query execution in Chapter 11.
关于请求路由的讨论主要集中于为单个键找到对应的分片,这对于分片化的 OLTP 数据库最为相关。分析型数据库也常常使用分片,但它们的查询执行方式通常非常不同:查询通常需要在多个不同的分片中并行地聚合和连接数据,而不是在单个分片中执行。我们将在第 11 章讨论这种并行查询执行的技术。

Sharding and Secondary Indexes 分片与辅助索引#

The sharding schemes we have discussed so far rely on the client knowing the partition key for any record it wants to access. This is most easily done in a key-value data model, where the partition key is the first part of the primary key (or the entire primary key), and so we can use the partition key to determine the shard, and thus route reads and writes to the node that is responsible for that key.
我们之前讨论的分片方案依赖于客户端知道它想要访问的任何记录的分区键。这在键值数据模型中最容易实现,其中分区键是主键的第一部分(或整个主键),因此我们可以使用分区键来确定分片,并据此将读写请求路由到负责该键的节点。

The situation becomes more complicated if secondary indexes are involved (see also “Multi-Column and Secondary Indexes”). A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value: find all actions by user 123, find all articles containing the word hogwash, find all cars whose color is red, and so on.
如果涉及二级索引(另见 “多列和二级索引”),情况会变得更加复杂。二级索引通常不能唯一标识一条记录,而是一种查找特定值出现方式的方法:查找用户 123 的所有操作,查找包含单词 hogwash 的所有文章,查找颜色为 red 的所有汽车等等。

Key-value stores often don’t have secondary indexes, but they are the bread and butter of relational databases, they are common in document databases too, and they are the raison d’être of full-text search engines such as Solr and Elasticsearch. The problem with secondary indexes is that they don’t map neatly to shards. There are two main approaches to sharding a database with secondary indexes: local and global indexes.
键值存储通常没有二级索引,但它们是关系型数据库的核心,在文档数据库中也常见,并且是 Solr 和 Elasticsearch 等全文搜索引擎存在的根本原因。二级索引的问题在于它们不能很好地映射到分片上。在带二级索引的数据库分片主要有两种方法:本地索引和全局索引。

Local Secondary Indexes 本地二级索引#

For example, imagine you are operating a website for selling used cars (illustrated in Figure 7-9). Each listing has a unique ID, and you use that ID as partition key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.).
例如,假设你正在运营一个二手汽车销售网站(如图 7-9 所示)。每个列表都有一个唯一 ID,你使用该 ID 作为分片的分区键(例如,ID 0 到 499 在分片 0,ID 500 到 999 在分片 1 等等)。

If you want to let users search for cars, allowing them to filter by color and by make, you need a secondary index on color and make (in a document database these would be fields; in a relational database they would be columns). If you have declared the index, the database can perform the indexing automatically. For example, whenever a red car is added to the database, the database shard automatically adds its ID to the list of IDs for the index entry color:red. As discussed in Chapter 4, that list of IDs is also called a postings list.
如果你想让用户搜索汽车,并允许他们按颜色和品牌进行筛选,你需要在 colormake 上创建一个辅助索引(在文档数据库中这些会是字段;在关系型数据库中它们会是列)。如果你已经声明了索引,数据库可以自动执行索引操作。例如,每当数据库中添加一辆红色汽车时,数据库分片会自动将其 ID 添加到索引条目 color:red 的 ID 列表中。正如第四章所述,这个 ID 列表也被称为倒排列表。

ddia 0709

Warning 警告#

If your database only supports a key-value model, you might be tempted to implement a secondary index yourself by creating a mapping from values to IDs in application code. If you go down this route, you need to take great care to ensure your indexes remain consistent with the underlying data. Race conditions and intermittent write failures (where some changes were saved but others weren’t) can very easily cause the data to go out of sync—see “The need for multi-object transactions”.
如果你的数据库仅支持键值模型,你可能会想通过在应用代码中创建从值到 ID 的映射来自己实现一个二级索引。如果你选择这条路,你需要非常小心地确保你的索引与底层数据保持一致。竞争条件和间歇性写入失败(有些更改被保存了,而有些没有)很容易导致数据不同步 —— 参见 “多对象事务的需求”。

In this indexing approach, each shard is completely separate: each shard maintains its own secondary indexes, covering only the records in that shard. It doesn’t care what data is stored in other shards. Whenever you write to the database—to add, remove, or update a records—you only need to deal with the shard that contains the record that you are writing. For that reason, this type of secondary index is known as a local index. In an information retrieval context it is also known as a document-partitioned index [30].
在这种索引方法中,每个分片是完全独立的:每个分片维护自己的二级索引,仅覆盖该分片中的记录。它不在乎其他分片中存储了什么数据。每次你写入数据库时 —— 无论是添加、删除还是更新记录 —— 你只需要处理包含你要写入记录的那个分片。因此,这种类型的二级索引被称为本地索引。在信息检索的上下文中,它也被称为文档分区索引 [30]。

When reading from a local secondary index, if you already know the partition key of the record you’re looking for, you can just perform the search on the appropriate shard. Moreover, if you only want some results, and you don’t need all, you can send the request to any shard.
当从本地二级索引读取时,如果你已经知道你要查找的记录的分区键,你只需在相应的分片上执行搜索。此外,如果你只需要部分结果,而不需要全部,你可以将请求发送到任何分片。

However, if you want all the results and don’t know their partition key in advance, you need to send the query to all shards, and combine the results you get back, because the matching records might be scattered across all the shards. In Figure 7-9, red cars appear in both shard 0 and shard 1.
然而,如果你需要所有结果并且事先不知道它们的分区键,你需要将查询发送到所有分片,并组合你得到的结果,因为匹配的记录可能分布在所有分片中。在图 7-9 中,红色汽车出现在分片 0 和分片 1 中。

This approach to querying a sharded database can make read queries on secondary indexes quite expensive. Even if you query the shards in parallel, it is prone to tail latency amplification (see “Use of Response Time Metrics”). It also limits the scalability of your application: adding more shards lets you store more data, but it doesn’t increase your query throughput if every shard has to process every query anyway.
这种查询分片数据库的方法可以使二级索引上的读取查询非常昂贵。即使你并行查询分片,也容易导致尾部延迟放大(参见 “响应时间指标的使用”)。它还限制了你的应用程序的可扩展性:添加更多分片可以让你存储更多数据,但如果每个分片都必须处理每个查询,它不会增加你的查询吞吐量。

Nevertheless, local secondary indexes are widely used [31]: for example, MongoDB, Riak, Cassandra [32], Elasticsearch [33], SolrCloud, and VoltDB [34] all use local secondary indexes.
尽管如此,本地二级索引被广泛使用 [31]:例如,MongoDB、Riak、Cassandra [32]、Elasticsearch [33]、SolrCloud 和 VoltDB [34] 都使用本地二级索引。

Global Secondary Indexes 全局辅助索引#

Rather than each shard having its own, local secondary index, we can construct a global index that covers data in all shards. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded, but it can be sharded differently from the primary key index.
与其让每个分片拥有自己的本地辅助索引,我们可以构建一个覆盖所有分片数据的全局索引。然而,我们不能将这个索引存储在一个节点上,因为它很可能成为瓶颈,从而破坏分片的初衷。全局索引也必须被分片,但它可以与主键索引以不同的方式被分片。

Figure 7-10 illustrates what this could look like: the IDs of red cars from all shards appear under color:red in the index, but the index is sharded so that colors starting with the letters a to r appear in shard 0 and colors starting with s to z appear in shard 1. The index on the make of car is partitioned similarly (with the shard boundary being between f and h).
图 7-10 说明了这可能是什么样子:所有分片中红色汽车的车牌号都出现在索引的 color:red 下,但索引被分片,使得以字母 a 到 r 开头的颜色出现在分片 0,以字母 s 到 z 开头的颜色出现在分片 1。汽车品牌的索引也以类似的方式被分区(分片边界在 f 和 h 之间)。

ddia 0710

This kind of index is also called term-partitioned [30]: recall from “Full-Text Search” that in full-text search, a term is a keyword in a text that you can search for. Here we generalise it to mean any value that you can search for in the secondary index.
这种索引也称为术语分区 [30]:回顾 “全文搜索” 部分,全文搜索中,术语是文本中可以搜索的关键字。在这里,我们将其泛化,表示任何可以在二级索引中搜索的值。

The global index uses the term as partition key, so that when you’re looking for a particular term or value, you can figure out which shard you need to query. As before, a shard can contain a contiguous range of terms (as in Figure 7-10), or you can assign terms to shards based on a hash of the term.
全局索引使用术语作为分区键,这样当你查找特定术语或值时,就能确定需要查询哪个分片。如前所述,一个分片可以包含连续范围的术语(如图 7-10 所示),或者你可以根据术语的哈希值将术语分配给分片。

Global indexes have the advantage that a query with a single condition (such as color = red) only needs to read from a single shard to fetch the postings list. However, if you want to fetch records and not just IDs, you still have to read from all the shards that are responsible for those IDs.
全局索引的优势在于,带有单一条件的查询(如 color = red)只需要从单个分片读取即可获取倒排索引列表。然而,如果你想要获取记录而不仅仅是 ID,你仍然需要从负责这些 ID 的所有分片读取。

If you have multiple search conditions or terms (e.g., searching for cars of a certain color and a certain make, or searching for multiple words occurring in the same text), it’s likely that those terms will be assigned to different shards. To compute the logical AND of the two conditions, the system needs to find all the IDs that occur in both of the postings lists. That’s no problem if the postings lists are short, but if they are long, it can be slow to send them over the network to compute their intersection [30].
如果你有多个搜索条件或术语(例如,搜索特定颜色和特定型号的汽车,或搜索同一文本中出现的多个词语),这些术语很可能被分配到不同的分片中。为了计算两个条件的逻辑 AND,系统需要找到在两个倒排索引列表中都出现的所有 ID。如果倒排索引列表较短,这不成问题,但如果列表较长,通过网络发送它们来计算它们的交集可能会很慢 [30]。

Another challenge with global secondary indexes is that writes are more complicated than with local indexes, because writing a single record might affect multiple shards of the index (every term in the document might be on a different shard). This makes it harder to keep the secondary index in sync with the underlying data. One option is to use a distributed transaction to atomically update the shards storing the primary record and its secondary indexes (see Chapter 8).
全局二级索引的另一个挑战是写入操作比本地索引更复杂,因为单条记录的写入可能会影响索引的多个分片(文档中的每个词可能位于不同的分片中)。这使得保持二级索引与底层数据同步变得更加困难。一个解决方案是使用分布式事务来原子性地更新存储主记录及其二级索引的分片(参见第 8 章)。

Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global indexes, so reads from a global index may be stale (similarly to replication lag, as in “Problems with Replication Lag”). Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if the postings lists are not too long.
全局二级索引被 CockroachDB、TiDB 和 YugabyteDB 使用;DynamoDB 支持本地和全局二级索引。在 DynamoDB 的情况下,写入操作异步反映在全局索引中,因此从全局索引读取的数据可能已过时(类似于 “复制延迟问题” 中的复制延迟)。尽管如此,如果读取吞吐量高于写入吞吐量,并且发布列表不会太长,全局索引仍然很有用。

Summary 摘要#

In this chapter we explored different ways of sharding a large dataset into smaller subsets. Sharding is necessary when you have so much data that storing and processing it on a single machine is no longer feasible.
在本章中,我们探讨了将大型数据集划分为较小子集的不同方法。当数据量如此之大,以至于在单台机器上存储和处理不再可行时,就需要进行分片。

The goal of sharding is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). This requires choosing a sharding scheme that is appropriate to your data, and rebalancing the shards when nodes are added to or removed from the cluster.
分片的目标是将数据和查询负载均匀地分布在多台机器上,避免热点(负载不均的节点)。这需要选择适合您数据的分片方案,并在节点添加到或从集群中移除时重新平衡分片。

We discussed two main approaches to sharding:
我们讨论了两种主要的分片方法:

  • Key range sharding, where keys are sorted, and a shard owns all the keys from some minimum up to some maximum. Sorting has the advantage that efficient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order.
    范围分片(键有序),一个分片拥有从某个最小值到某个最大值的所有键,比如 将 a-z,切分为 a-fg-uv-z。有序键可以高效的范围查询,但如果应用程序经常访问排序顺序中相邻的键,则存在热点风险,比如大部分键集中在 g-u 的情况。
    In this approach, shards are typically rebalanced by splitting the range into two subranges when a shard gets too big.
    在这种方法中,当分片变得过大时,通常会通过将范围拆分为两个子范围来重新平衡分片。
  • Hash sharding, where a hash function is applied to each key, and a shard owns a range of hash values (or another consistent hashing algorithm may be used to map hashes to shards). This method destroys the ordering of keys, making range queries inefficient, but it may distribute load more evenly.
    哈希分片,其中对每个键进行哈希,这种键是无序的,范围查询效率低下,但它负载可能更均衡。
    When sharding by hash, it is common to create a fixed number of shards in advance, to assign several shards to each node, and to move entire shards from one node to another when nodes are added or removed. Splitting shards, like with key ranges, is also possible.
    在哈希分片时,通常预先创建固定数量的分片,将多个分片分配给每个节点,并在节点添加或删除时将整个分片从一个节点移动到另一个节点。与键范围类似,拆分分片也是可能的。

It is common to use the first part of the key as the partition key (i.e., to identify the shard), and to sort records within that shard by the rest of the key. That way you can still have efficient range queries among the records with the same partition key.
通常将键的第一部分用作分区键(即用于识别分片),并在该分片中按键的其余部分对记录进行排序。这样,你仍然可以在具有相同分区键的记录之间进行高效的范围查询。

We also discussed the interaction between sharding and secondary indexes. A secondary index also needs to be sharded, and there are two methods:
我们还讨论了分片与二级索引之间的交互。二级索引也需要分片,有两种方法:

  • Local secondary indexes, where the secondary indexes are stored in the same shard as the primary key and value. This means that only a single shard needs to be updated on write, but a lookup of the secondary index requires reading from all shards.
    本地二级索引,其中二级索引与主键和值存储在同一个分片中。这意味着在写入时只需要更新单个分片,但查找二级索引时需要从所有分片读取。
  • Global secondary indexes, which are sharded separately based on the indexed values. An entry in the secondary index may refer to records from all shards of the primary key. When a record is written, several secondary index shards may need to be updated; however, a read of the postings list can be served from a single shard (fetching the actual records still requires reading from multiple shards).
    全局二级索引根据索引值分别在不同的分片中进行分片。二级索引中的一个条目可能引用主键的所有分片中的记录。当写入一条记录时,可能需要更新多个二级索引分片;然而,读取发布列表可以从单个分片中提供(获取实际记录仍然需要从多个分片中读取)。

Finally, we discussed techniques for routing queries to the appropriate shard, and how a coordination service is often used to keep track of the assigment of shards to nodes.
最后,我们讨论了将查询路由到适当分片的技术,以及协调服务通常如何用于跟踪分片到节点的分配。

By design, every shard operates mostly independently—that’s what allows a sharded database to scale to multiple machines. However, operations that need to write to several shards can be problematic: for example, what happens if the write to one shard succeeds, but another fails? We will address that question in the following chapters.
按设计,每个分片基本上独立运行 —— 这正是允许分片数据库扩展到多台机器的原因。然而,需要写入多个分片的操作可能会出现问题:例如,如果写入一个分片成功,但另一个失败怎么办?我们将在接下来的章节中探讨这个问题。

Footnotes 脚注#
References 参考文献#

[1] Claire Giordano.Understanding partitioning and sharding in Postgres and Citus. citusdata.com, August 2023. Archived at perma.cc/8BTK-8959
[1] Claire Giordano. 理解 Postgres 和 Citus 中的分区和分片。citusdata.com, 2023 年 8 月。存档于 perma.cc / 8BTK - 8959

[2] Brandur Leach.Partitioning in Postgres, 2022 edition. brandur.org, October 2022. Archived at perma.cc/Z5LE-6AKX
[2] Brandur Leach. 《Postgres 中的分区》,2022 年版。brandur.org,2022 年 10 月。存档于 perma.cc / Z5LE - 6AKX

[3] Raph Koster.Database “sharding” came from UO?raphkoster.com, January 2009. Archived at perma.cc/4N9U-5KYF
[3] Raph Koster. 数据库 “分片” 源自《UO》?raphkoster.com,2009 年 1 月。存档于 perma.cc / 4N9U - 5KYF

[4] Garrett Fidalgo.Herding elephants: Lessons learned from sharding Postgres at Notion. notion.com, October 2021. Archived at perma.cc/5J5V-W2VX
[4] Garrett Fidalgo. 牵引大象:在 Notion 分片 Postgres 的教训。notion.com,2021 年 10 月。存档于 perma.cc / 5J5V - W2VX

[5] Ulrich Drepper.What Every Programmer Should Know About Memory.akkadia.org, November 2007. Archived at perma.cc/NU6Q-DRXZ
[5] Ulrich Drepper. 每个程序员都应该知道的关于内存。akkadia.org,2007 年 11 月。存档于 perma.cc / NU6Q - DRXZ

[6] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav.FoundationDB: A Distributed Unbundled Transactional Key Value Store. At ACM International Conference on Management of Data (SIGMOD), June 2021.doi.1145/3448016.3457559
[6] 周景宇,徐萌,Alexander Shraer,Bala Namasivayam,Alex Miller,Evan Tschannen,Steve Atherton,Andrew J. Beamon,Rusty Sears,John Leach,Dave Rosenthal,董欣,威尔逊,本・柯林斯,David Scherer,Alec Grieser,刘勇,Alvin Moore,Bhaskar Muppana,苏晓鸽,Vishesh Yadav. FoundationDB:一个分布式解耦事务键值存储。在 ACM 国际数据管理会议(SIGMOD),2021 年 6 月。doi.1145/3448016.3457559

[7] Marco Slot.Citus 12: Schema-based sharding for PostgreSQL. citusdata.com, July 2023. Archived at perma.cc/R874-EC9W
[7] Marco Slot. Citus 12:基于模式的 PostgreSQL 分片。citusdata.com,2023 年 7 月。保存在 perma.cc/R874-EC9W

[8] Robisson Oliveira.Reducing the Scope of Impact with Cell-Based Architecture. AWS Well-Architected white paper, Amazon Web Services, September 2023. Archived at perma.cc/4KWW-47NR
[8] Robisson Oliveira. 基于单元架构减少影响范围。AWS Well-Architected 白皮书,亚马逊网络服务,2023 年 9 月。保存在 perma.cc / 4KWW - 47NR

[9] Gwen Shapira.Things DBs Don’t Do - But Should.thenile.dev, February 2023. Archived at perma.cc/C3J4-JSFW
[9] Gwen Shapira. 数据库不做什么 —— 但应该做。thenile.dev,2023 年 2 月。保存在 perma.cc / C3J4 - JSFW

[10] Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris.Position: GDPR Compliance by Construction. At Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data (Poly), August 2019.doi.1007/978-3-030-33752-0_3
[10] Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek 和 Robert Morris. 主题:通过构建实现 GDPR 合规性。发表于《面向管理多数据库、隐私、安全及 / 或政策问题的异构数据的多商店系统》(Poly),2019 年 8 月。doi.1007/978-3-030-33752-0_3

[11] Gwen Shapira.Introducing pg_karnak: Transactional schema migration across tenant databases. thenile.dev, November 2024. Archived at perma.cc/R5RD-8HR9
[11] Gwen Shapira. 介绍 pg_karnak:跨租户数据库的事务性模式迁移。thenile.dev,2024 年 11 月。存档于 perma.cc / R5RD - 8HR9

[12] Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón.Scaling Datastores at Slack with Vitess. slack.engineering, December 2020. Archived at perma.cc/UW8F-ALJK
[12] 阿尔卡・冈古里、吉多・亚昆蒂、梅吉・周和拉斐尔・查孔. 在 Slack 中用 Vitess 扩展数据存储. slack.engineering, 2020 年 12 月. 存档于 perma.cc / UW8F - ALJK

[13] Ikai Lan.App Engine Datastore Tip: Monotonically Increasing Values Are Bad. ikaisays.com, January 2011. Archived at perma.cc/BPX8-RPJB

[14] Enis Soztutar.Apache HBase Region Splitting and Merging. cloudera.com, February 2013. Archived at perma.cc/S9HS-2X2C

[15] Eric Evans.Rethinking Topology in Cassandra. At Cassandra Summit, June 2013. Archived at perma.cc/2DKM-F438

[16] Martin Kleppmann.Java’s hashCode Is Not Safe for Distributed Systems. martin.kleppmann.com, June 2012. Archived at perma.cc/LK5U-VZSN

[17] Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig.Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service. At USENIX Annual Technical Conference (ATC), July 2022.
[17] Mostafa Elhemali、Niall Gallagher、Nicholas Gordon、Joseph Idziorek、Richard Krog、Colin Lazier、Erben Mo、Akhilesh Mritunjai、Somu Perianayagam、Tim Rath、Swami Sivasubramanian、James Christopher Sorenson III、Sroaj Sosothikul、Doug Terry 和 Akshat Vig. Amazon DynamoDB:一个可扩展、性能可预测且完全托管的 NoSQL 数据库服务。在 USENIX 年度技术会议(ATC)上,2022 年 7 月。

[18] Brandon Williams.Virtual Nodes in Cassandra 1.2. datastax.com, December 2012. Archived at perma.cc/N385-EQXV
[18] Brandon Williams. Cassandra 1.2 中的虚拟节点。datastax.com,2012 年 12 月。存档于 perma.cc/N385-EQXV。

[19] Branimir Lambov.New Token Allocation Algorithm in Cassandra 3.0. datastax.com, January 2016. Archived at perma.cc/2BG7-LDWY
[19] Branimir Lambov. Cassandra 3.0 中的新令牌分配算法。datastax.com,2016 年 1 月。存档于 perma.cc / 2BG7 - LDWY。

[20] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin.Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. At 29th Annual ACM Symposium on Theory of Computing (STOC), May 1997.doi.1145/258533.258660
[20] David Karger、Eric Lehman、Tom Leighton、Rina Panigrahy、Matthew Levine 和 Daniel Lewin. 一致哈希和随机树:用于缓解万维网热点问题的分布式缓存协议。在 ACM 计算理论第 29 届年会上,1997 年 5 月。doi.1145/258533.258660

[21] Damian Gryski.Consistent Hashing: Algorithmic Tradeoffs. dgryski.medium.com, April 2018. Archived at perma.cc/B2WF-TYQ8
[21] Damian Gryski. 一致性哈希:算法权衡。dgryski.medium.com, 2018 年 4 月。存档于 perma.cc / B2WF - TYQ8

[22] David G. Thaler and Chinya V. Ravishankar.Using name-based mappings to increase hit rates. IEEE/ACM Transactions on Networking, volume 6, issue 1, pages 1–14, February 1998.doi.1109/90.663936
[22] David G. Thaler 和 Chinya V. Ravishankar. 使用基于名称的映射来提高命中率。IEEE / ACM 计算机网络汇刊,第 6 卷,第 1 期,第 1-14 页,1998 年 2 月。doi.1109/90.663936

[23] John Lamping and Eric Veach.A Fast, Minimal Memory, Consistent Hash Algorithm. arxiv.org, June 2014.
[23] John Lamping 和 Eric Veach. 一种快速、最小内存的一致性哈希算法。arxiv.org, 2014 年 6 月。

[24] Samuel Axon.3% of Twitter’s Servers Dedicated to Justin Bieber. mashable.com, September 2010. Archived at perma.cc/F35N-CGVX
[24] Samuel Axon. 推特 3% 的服务器用于贾斯汀・比伯。mashable.com, 2010 年 9 月。存档于 perma.cc / F35N - CGVX

[25] Gerald Guo and Thawan Kooburat.Scaling services with Shard Manager. engineering.fb.com, August 2020. Archived at perma.cc/EFS3-XQYT
[25] Gerald Guo 和 Thawan Kooburat. 使用 Shard Manager 扩展服务. engineering.fb.com, 2020 年 8 月. 归档于 perma.cc/EFS3-XQYT

[26] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang.Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. 28th ACM SIGOPS Symposium on Operating Systems Principles (SOSP), pages 553–569, October 2021.doi.1145/3477132.3483546
[26] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, 和 Chunqiang Tang. Shard Manager: 一种适用于地缘分布式应用的通用分片管理框架. 第 28 届 ACM SIGOPS 操作系统原理研讨会 (SOSP), 第 553-569 页, 2021 年 10 月. doi.1145/3477132.3483546

[27] Scott Lystig Fritchie.A Critique of Resizable Hash Tables: Riak Core & Random Slicing. infoq.com, August 2018. Archived at perma.cc/RPX7-7BLN
[27] Scott Lystig Fritchie. 可调整大小哈希表的批判:Riak Core & 随机切片. infoq.com, 2018 年 8 月. 永久存档于 perma.cc/RPX7-7BLN

[28] Andy Warfield.Building and operating a pretty big storage system called S3. allthingsdistributed.com, July 2023. Archived at perma.cc/6S7P-GLM4
[28] Andy Warfield. 构建和运营一个名为 S3 的庞大存储系统. allthingsdistributed.com, 2023 年 7 月. 永久存档于 perma.cc / 6S7P - GLM4

[29] Rich Houlihan.DynamoDB adaptive capacity: smooth performance for chaotic workloads (DAT327). At AWS re, November 2017.
[29] Rich Houlihan. DynamoDB 自适应容量:为混乱负载提供平稳性能 (DAT327). 在 AWS re 大会,2017 年 11 月.

[30] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book
[30] Christopher D. Manning, Prabhakar Raghavan, 和 Hinrich Schütze. 信息检索导论. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, 可在线获取于 nlp.stanford.edu/IR-book

[31] Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin.Earlybird: Real-Time Search at Twitter. At 28th IEEE International Conference on Data Engineering (ICDE), April 2012.doi.1109/ICDE.2012.149
[31] Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill 和 Jimmy Lin. Earlybird: Twitter 的实时搜索. 在第 28 届 IEEE 国际数据工程会议 (ICDE), 2012 年 4 月. doi.1109/ICDE.2012.149

[32] Nadav Har’El.Indexing in Cassandra 3.github.com, April 2017. Archived at perma.cc/3ENV-8T9P
[32] Nadav Har’El. Cassandra 3 中的索引. github.com, 2017 年 4 月. 存档于 perma.cc / 3ENV - 8T9P

[33] Zachary Tong.Customizing Your Document Routing. elastic.co, June 2013. Archived at perma.cc/97VM-MREN
[33] Zachary Tong. 自定义文档路由. elastic.co, 2013 年 6 月. 存档于 perma.cc / 97VM - MREN

[34] Andrew Pavlo.H-Store Frequently Asked Questions.hstore.cs.brown.edu, October 2013. Archived at perma.cc/X3ZA-DW6Z
[34] Andrew Pavlo. H-Store 常见问题解答. hstore.cs.brown.edu, 2013 年 10 月. 存档于 perma.cc / X3ZA - DW6Z