4. Encoding of Distances (距离编码)
官方英文原文 (Official English Text)
As described in Section 2, one component of a compressed meta-block is a sequence of backward distances. In this section, we provide the details to the encoding of distances.
Each distance in the compressed data part of a meta-block is represented with a pair <distance code, extra bits>. The distance code and the extra bits are encoded back-to-back, the distance code is encoded using a prefix code over the distance alphabet, while the extra bits value is encoded as a fixed-width integer value. The number of extra bits can be 0..24, and it is dependent on the distance code.
To convert a distance code and associated extra bits to a backward distance, we need the sequence of past distances and two additional parameters: the number of "postfix bits", denoted by NPOSTFIX (0..3), and the number of direct distance codes, denoted by NDIRECT (0..120). Both of these parameters are encoded in the meta-block header. We will also use the following derived parameter:
POSTFIX_MASK = (1 << NPOSTFIX) - 1
The first 16 distance symbols are special symbols that reference past distances as follows:
0: last distance
1: second-to-last distance
2: third-to-last distance
3: fourth-to-last distance
4: last distance - 1
5: last distance + 1
6: last distance - 2
7: last distance + 2
8: last distance - 3
9: last distance + 3
10: second-to-last distance - 1
11: second-to-last distance + 1
12: second-to-last distance - 2
13: second-to-last distance + 2
14: second-to-last distance - 3
15: second-to-last distance + 3
The ring buffer of the four last distances is initialized by the values 16, 15, 11, and 4 (i.e., the fourth-to-last is set to 16, the third-to-last to 15, the second-to-last to 11, and the last distance to 4) at the beginning of the stream (as opposed to the beginning of the meta-block), and it is not reset at meta-block boundaries. When a distance symbol 0 appears, the distance it represents (i.e., the last distance in the sequence of distances) is not pushed to the ring buffer of last distances; in other words, the expression "second-to-last distance" means the second-to-last distance that was not represented by a 0 distance symbol (and similar for "third-to-last distance" and "fourth-to-last distance"). Similarly, distances that represent static dictionary words (see Section 8) are not pushed to the ring buffer of last distances.
If a special distance symbol resolves to a zero or negative value, the stream should be rejected as invalid.
If NDIRECT is greater than zero, then the next NDIRECT distance symbols, from 16 to 15 + NDIRECT, represent distances from 1 to NDIRECT. Neither the special distance symbols nor the NDIRECT direct distance symbols are followed by any extra bits.
Distance symbols 16 + NDIRECT and greater all have extra bits, where the number of extra bits for a distance symbol "dcode" is given by the following formula:
ndistbits = 1 + ((dcode - NDIRECT - 16) >> (NPOSTFIX + 1))
The maximum number of extra bits is 24; therefore, the size of the distance symbol alphabet is (16 + NDIRECT + (48 << NPOSTFIX)).
Given a distance symbol "dcode" (>= 16 + NDIRECT), and extra bits "dextra", the backward distance is given by the following formula:
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
offset = ((2 + (hcode & 1)) << ndistbits) - 4
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
中文翻译 (Chinese Translation)
如第 2 节所述,压缩元块 (Compressed Meta-Block) 的一个组成部分是后向距离 (Backward Distance) 序列。在本节中,我们提供距离编码的详细信息.
元块压缩数据部分中的每个距离用一对 <distance code, extra bits> (距离编码,额外位) 表示。距离编码和额外位背靠背编码 (Encoded Back-to-Back),距离编码使用距离字母表 (Distance Alphabet) 上的前缀编码进行编码,而额外位值 (Extra Bits Value) 编码为固定宽度整数值 (Fixed-Width Integer Value)。额外位的数量可以是 0..24,并且取决于距离编码.
要将距离编码和相关的额外位转换为后向距离,我们需要过去距离的序列 (Sequence of Past Distances) 以及两个附加参数: "后缀位" (Postfix Bits) 的数量,用 NPOSTFIX (0..3) 表示,以及直接距离编码 (Direct Distance Code) 的数量,用 NDIRECT (0..120) 表示。这两个参数都在元块头部中编码。我们还将使用以下派生参数 (Derived Parameter):
POSTFIX_MASK = (1 << NPOSTFIX) - 1
前 16 个距离符号 (Distance Symbol) 是特殊符号 (Special Symbol),它们按如下方式引用过去的距离:
0: 最后一个距离 (last distance)
1: 倒数第二个距离 (second-to-last distance)
2: 倒数第三个距离 (third-to-last distance)
3: 倒数第四个距离 (fourth-to-last distance)
4: 最后一个距离 - 1
5: 最后一个距离 + 1
6: 最后一个距离 - 2
7: 最后一个距离 + 2
8: 最后一个距离 - 3
9: 最后一个距离 + 3
10: 倒数第二个距离 - 1
11: 倒数第二个距离 + 1
12: 倒数第二个距离 - 2
13: 倒数第二个距离 + 2
14: 倒数第二个距离 - 3
15: 倒数第二个距离 + 3
四个最后距离的环形缓冲区 (Ring Buffer) 在流 (Stream) 开始时(而不是元块开始时)由值 16, 15, 11, 4 初始化(即,倒数第四个设置为 16,倒数第三个设置为 15,倒数第二个设置为 11,最后一个距离设置为 4),并且它不在元块边界 (Meta-Block Boundary) 处重置。当出现距离符号 0 时,它表示的距离(即,距离序列中的最后一个距离)不会被推送到最后距离的环形缓冲区; 换句话说,表达式"倒数第二个距离"表示倒数第二个不由距离符号 0 表示的距离("倒数第三个距离"和"倒数第四个距离"也是如此)。类似地,表示静态字典词 (Static Dictionary Word)(参见第 8 节)的距离不会被推送到最后距离的环形缓冲区.
如果特殊距离符号解析为零或负值,则应拒绝流为无效 (Invalid).
如果 NDIRECT 大于零,则接下来的 NDIRECT 个距离符号,从 16 到 15 + NDIRECT,表示从 1 到 NDIRECT 的距离。特殊距离符号和 NDIRECT 个直接距离符号后面都不跟任何额外位.
距离符号 16 + NDIRECT 及更大的符号都有额外位,其中距离符号 "dcode" 的额外位数 (Number of Extra Bits) 由以下公式给出:
ndistbits = 1 + ((dcode - NDIRECT - 16) >> (NPOSTFIX + 1))
额外位的最大数量为 24; 因此,距离符号字母表的大小 (Size of Distance Symbol Alphabet) 为 (16 + NDIRECT + (48 << NPOSTFIX)).
给定距离符号 "dcode" (>= 16 + NDIRECT) 和额外位 "dextra",后向距离 (Backward Distance) 由以下公式给出:
hcode = (dcode - NDIRECT - 16) >> NPOSTFIX
lcode = (dcode - NDIRECT - 16) & POSTFIX_MASK
offset = ((2 + (hcode & 1)) << ndistbits) - 4
distance = ((offset + dextra) << NPOSTFIX) + lcode + NDIRECT + 1
关键概念说明
环形缓冲区机制:
- 维护最近 4 个距离值
- 初始值:
[16, 15, 11, 4] - 在整个流中持久化,不在元块边界重置
特殊符号处理:
- 符号 0-15: 引用环形缓冲区中的历史距离
- 符号 0 和静态字典距离不更新环形缓冲区
距离编码结构:
- 符号 0-15: 特殊引用符号(无额外位)
- 符号 16 到
15 + NDIRECT: 直接距离值(无额外位) - 符号
16 + NDIRECT及以上: 需要额外位的编码距离
来源 (Source): RFC 7932, Section 4
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt