3. Compressed Representation of Prefix Codes (前缀编码的压缩表示)
官方英文原文 (Official English Text)
3.1. Introduction to Prefix Coding
Prefix coding represents symbols from an a priori known alphabet by bit sequences (codes), one code for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths, but a parser can always parse an encoded string unambiguously symbol-by-symbol.
We define a prefix code in terms of a binary tree in which the two edges descending from each non-leaf node are labeled 0 and 1, and in which the leaf nodes correspond one-for-one with (are labeled with) the symbols of the alphabet. The code for a symbol is the sequence of 0's and 1's on the edges leading from the root to the leaf labeled with that symbol. For example:
/\ Symbol Code
0 1 ------ ----
/ \ A 00
/\ B B 1
0 1 C 011
/ \ D 010
A /\
0 1
/ \
D C
A parser can decode the next symbol from the compressed stream by walking down the tree from the root, at each step choosing the edge corresponding to the next compressed data bit.
Given an alphabet with known symbol frequencies, the Huffman algorithm allows the construction of an optimal prefix code (one that represents strings with those symbol frequencies using the fewest bits of any possible prefix codes for that alphabet). Such a prefix code is called a Huffman code. (See [HUFFMAN] for additional information on Huffman codes.)
In the brotli format, note that the prefix codes for the various alphabets must not exceed certain maximum code lengths. This constraint complicates the algorithm for computing code lengths from symbol frequencies. Again, see [HUFFMAN] for details.
3.2. Use of Prefix Coding in the Brotli Format
The prefix codes used for each alphabet in the brotli format are canonical prefix codes, which have two additional rules:
-
All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent;
-
Shorter codes lexicographically precede longer codes.
We could recode the example above to follow this rule as follows, assuming that the order of the alphabet is ABCD:
Symbol Code
------ ----
A 10
B 0
C 110
D 111
That is, 0 precedes 10, which precedes 11x, and 110 and 111 are lexicographically consecutive.
Given this rule, we can define the canonical prefix code for an alphabet just by giving the bit lengths of the codes for each symbol of the alphabet in order; this is sufficient to determine the actual codes. In our example, the code is completely defined by the sequence of bit lengths (2, 1, 3, 3). The following algorithm generates the codes as integers, intended to be read from most to least significant bit. The code lengths are initially in tree[I].Len; the codes are produced in tree[I].Code.
-
Count the number of codes for each code length. Let
bl_count[N]be the number of codes of length N, N >= 1. -
Find the numerical value of the smallest code for each code length:
code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code;
}
- Assign numerical values to all codes, using consecutive values for all codes of the same length with the base values determined at step 2. Codes that are never used (which have a bit length of zero) must not be assigned a value.
for (n = 0; n <= max_code; n++) {
len = tree[n].Len;
if (len != 0) {
tree[n].Code = next_code[len];
next_code[len]++;
}
}
Example:
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, 3, 2, 4, 4). After step 1, we have:
N bl_count[N]
- -----------
2 1
3 5
4 2
Step 2 computes the following next_code values:
N next_code[N]
- ------------
1 0
2 0
3 2
4 14
Step 3 produces the following code values:
Symbol Length Code
------ ------ ----
A 3 010
B 3 011
C 3 100
D 3 101
E 3 110
F 2 00
G 4 1110
H 4 1111
3.3. Alphabet Sizes
Prefix codes are used for different purposes in the brotli format, and each purpose has a different alphabet size. For literal codes, the alphabet size is 256. For insert-and-copy length codes, the alphabet size is 704. For block count codes, the alphabet size is 26. For distance codes, block type codes, and the prefix codes used in compressing the context map, the alphabet size is dynamic and is based on parameters defined in later sections. The following table summarizes the alphabet sizes for the various prefix codes and the sections of this document in which they are defined.
+-----------------+-------------------------+------------+
| Prefix Code | Alphabet Size | Definition |
+-----------------+-------------------------+------------+
| literal | 256 | |
+-----------------+-------------------------+------------+
| distance | 16 + NDIRECT + | Section 4 |
| | (48 << NPOSTFIX) | |
+-----------------+-------------------------+------------+
| insert-and-copy | 704 | Section 5 |
| length | | |
+-----------------+-------------------------+------------+
| block count | 26 | Section 6 |
+-----------------+-------------------------+------------+
| block type | NBLTYPESx + 2, | Section 6 |
| | (where x is I, L, or D) | |
+-----------------+-------------------------+------------+
| context map | NTREESx + RLEMAXx | Section 7 |
| | (where x is L or D) | |
+-----------------+-------------------------+------------+
3.4. Simple Prefix Codes
The first two bits of the compressed representation of each prefix code distinguish between simple and complex prefix codes. If this value is 1, then a simple prefix code follows as described in this section. Otherwise, a complex prefix code follows as described in Section 3.5.
A simple prefix code can have up to four symbols with non-zero code length. The format of the simple prefix code is as follows:
2 bits: value of 1 indicates a simple prefix code
2 bits: NSYM - 1, where NSYM = number of symbols coded
NSYM symbols, each encoded using ALPHABET_BITS bits
1 bit: tree-select, present only for NSYM = 4
The value of ALPHABET_BITS depends on the alphabet of the prefix code: it is the smallest number of bits that can represent all symbols in the alphabet. For example, for the alphabet of literal bytes, ALPHABET_BITS is 8. The value of each of the NSYM symbols above is the value of the ALPHABET_BITS width integer value. If the integer value is greater than or equal to the alphabet size, or the value is identical to a previous value, then the stream should be rejected as invalid.
Note that the NSYM symbols may not be presented in sorted order. Prefix codes of the same bit length must be assigned to the symbols in sorted order.
The (non-zero) code lengths of the symbols can be reconstructed as follows:
-
if
NSYM= 1, the code length for the one symbol is zero -- when encoding this symbol in the compressed data stream using this prefix code, no actual bits are emitted. Similarly, when decoding a symbol using this prefix code, no bits are read and the one symbol is returned. -
if
NSYM= 2, both symbols have code length 1. -
if
NSYM= 3, the code lengths for the symbols are 1, 2, 2 in the order they appear in the representation of the simple prefix code. -
if
NSYM= 4, the code lengths (in order of symbols decoded) depend on the tree-select bit: 2, 2, 2, 2 (tree-select bit 0), or 1, 2, 3, 3 (tree-select bit 1).
3.5. Complex Prefix Codes
A complex prefix code is a canonical prefix code, defined by the sequence of code lengths, as discussed in Section 3.2. For even greater compactness, the code length sequences themselves are compressed using a prefix code. The alphabet for code lengths is as follows:
0..15: Represent code lengths of 0..15
16: Copy the previous non-zero code length 3..6 times.
The next 2 bits indicate repeat length
(0 = 3, ... , 3 = 6)
If this is the first code length, or all previous
code lengths are zero, a code length of 8 is
repeated 3..6 times.
A repeated code length code of 16 modifies the
repeat count of the previous one as follows:
repeat count = (4 * (repeat count - 2)) +
(3..6 on the next 2 bits)
Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10)
will expand to 22 code lengths of 7
(1 + 4 * (6 - 2) + 5)
17: Repeat a code length of 0 for 3..10 times.
The next 3 bits indicate repeat length
(0 = 3, ... , 7 = 10)
A repeated code length code of 17 modifies the
repeat count of the previous one as follows:
repeat count = (8 * (repeat count - 2)) +
(3..10 on the next 3 bits)
Note that a code of 16 that follows an immediately preceding 16 modifies the previous repeat count, which becomes the new repeat count. The same is true for a 17 following a 17. A sequence of three or more 16 codes in a row or three of more 17 codes in a row is possible, modifying the count each time. Only the final repeat count is used. The modification only applies if the same code follows. A 16 repeat does not modify an immediately preceding 17 count nor vice versa.
A code length of 0 indicates that the corresponding symbol in the alphabet will not occur in the compressed data, and it should not participate in the prefix code construction algorithm given earlier. A complex prefix code must have at least two non-zero code lengths.
The bit lengths of the prefix code over the code length alphabet are compressed with the following variable-length code (as it appears in the compressed data, where the bits are parsed from right to left):
Symbol Code
------ ----
0 00
1 0111
2 011
3 10
4 01
5 1111
We can now define the format of the complex prefix code as follows:
-
2 bits:
HSKIP, the number of skipped code lengths, can have values of 0, 2, or 3. The skipped lengths are taken to be zero. (AnHSKIPof 1 indicates a Simple prefix code.) -
Code lengths for symbols in the code length alphabet given just above, in the order: 1, 2, 3, 4, 0, 5, 17, 6, 16, 7, 8, 9, 10, 11, 12, 13, 14, 15. If
HSKIPis 2, then the code lengths for symbols 1 and 2 are zero, and the first code length is for symbol 3. IfHSKIPis 3, then the code length for symbol 3 is also zero, and the first code length is for symbol 4.The code lengths of code length symbols are between 0 and 5, and they are represented with 2..4 bits according to the variable-length code above. A code length of 0 means the corresponding code length symbol is not used.
If
HSKIPis 2 or 3, a respective number of leading code lengths are implicit zeros and are not present in the code length sequence above.If there are at least two non-zero code lengths, any trailing zero code lengths are omitted, i.e., the last code length in the sequence must be non-zero. In this case, the sum of (32 >> code length) over all the non-zero code lengths must equal to 32.
If the lengths have been read for the entire code length alphabet and there was only one non-zero code length, then the prefix code has one symbol whose code has zero length. In this case, that symbol results in no bits being emitted by the compressor and no bits consumed by the decompressor. That single symbol is immediately returned when this code is decoded. An example of where this occurs is if the entire code to be represented has symbols of length 8. For example, a literal code that represents all literal values with equal probability. In this case the single symbol is 16, which repeats the previous length. The previous length is taken to be 8 before any code length code lengths are read.
-
Sequence of code length symbols, which is at most the size of the alphabet, encoded using the code length prefix code. Any trailing 0 or 17 must be omitted, i.e., the last encoded code length symbol must be between 1 and 16. The sum of (32768 >> code length) over all the non-zero code lengths in the alphabet, including those encoded using repeat code(s) of 16, must be equal to 32768. If the number of times to repeat the previous length or repeat a zero length would result in more lengths in total than the number of symbols in the alphabet, then the stream should be rejected as invalid.
中文翻译 (Chinese Translation)
3.1. Introduction to Prefix Coding (前缀编码简介)
前缀编码 (Prefix Coding) 用位序列(编码)从先验已知的字母表中表示符号,每个符号一个编码,其方式使得不同的符号可以由不同长度的位序列表示,但解析器总是可以逐个符号地明确解析编码字符串.
我们根据二叉树 (Binary Tree) 定义前缀编码,其中从每个非叶节点 (Non-Leaf Node) 下降的两条边分别标记为 0 和 1,并且其中的叶节点 (Leaf Node) 与字母表的符号一一对应(用符号标记)。符号的编码是从根到用该符号标记的叶的边上的 0 和 1 的序列。例如:
/\ Symbol Code
0 1 ------ ----
/ \ A 00
/\ B B 1
0 1 C 011
/ \ D 010
A /\
0 1
/ \
D C
解析器可以通过从根开始沿树向下走来从压缩流中解码下一个符号,在每一步选择与下一个压缩数据位对应的边.
给定具有已知符号频率 (Symbol Frequency) 的字母表,Huffman 算法允许构建最优前缀编码(使用尽可能少的位数表示具有这些符号频率的字符串,相对于该字母表的任何可能的前缀编码)。这种前缀编码称为 Huffman 编码 (Huffman Code)。(有关 Huffman 编码的其他信息,请参见 [HUFFMAN]。)
在 brotli 格式中,请注意各种字母表的前缀编码不得超过某些最大编码长度 (Maximum Code Length)。此约束使从符号频率计算编码长度的算法复杂化。同样,有关详细信息,请参见 [HUFFMAN].
3.2. Use of Prefix Coding in the Brotli Format (Brotli 格式中的前缀编码使用)
brotli 格式中用于每个字母表的前缀编码是规范前缀编码 (Canonical Prefix Code),它有两个附加规则:
-
给定位长度的所有编码具有字典顺序连续的值 (Lexicographically Consecutive Values),其顺序与它们表示的符号相同;
-
较短的编码在字典顺序上先于较长的编码.
我们可以按照此规则重新编码上面的示例,假设字母表的顺序是 ABCD:
Symbol Code
------ ----
A 10
B 0
C 110
D 111
也就是说,0 先于 10,10 先于 11x,并且 110 和 111 在字典顺序上是连续的.
给定此规则,我们可以仅通过给出字母表中每个符号的编码的位长度 (Bit Length) 来定义字母表的规范前缀编码; 这足以确定实际的编码。在我们的示例中,编码完全由位长度序列 (2, 1, 3, 3) 定义。以下算法生成作为整数的编码,旨在从最高有效位 (Most Significant Bit) 到最低有效位 (Least Significant Bit) 读取。编码长度最初在 tree[I].Len 中; 编码在 tree[I].Code 中生成.
步骤 1) 统计每个编码长度的编码数量。设 bl_count[N] 为长度 N 的编码数量,N >= 1.
步骤 2) 找到每个编码长度的最小编码的数值:
code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code;
}
步骤 3) 为所有编码分配数值,对相同长度的所有编码使用连续值,基本值在步骤 2 中确定。从不使用的编码(位长度为零)不得分配值.
for (n = 0; n <= max_code; n++) {
len = tree[n].Len;
if (len != 0) {
tree[n].Code = next_code[len];
next_code[len]++;
}
}
示例:
考虑字母表 ABCDEFGH,位长度为 (3, 3, 3, 3, 3, 2, 4, 4)。步骤 1 后,我们有:
N bl_count[N]
- -----------
2 1
3 5
4 2
步骤 2 计算以下 next_code 值:
N next_code[N]
- ------------
1 0
2 0
3 2
4 14
步骤 3 生成以下编码值:
Symbol Length Code
------ ------ ----
A 3 010
B 3 011
C 3 100
D 3 101
E 3 110
F 2 00
G 4 1110
H 4 1111
3.3. Alphabet Sizes (字母表大小)
前缀编码在 brotli 格式中用于不同的目的,每个目的都有不同的字母表大小 (Alphabet Size)。对于字面编码 (Literal Code),字母表大小为 256。对于插入和复制长度编码 (Insert-and-Copy Length Code),字母表大小为 704。对于块计数编码 (Block Count Code),字母表大小为 26。对于距离编码 (Distance Code)、块类型编码 (Block Type Code) 和压缩上下文映射中使用的前缀编码,字母表大小是动态的,并基于后续章节中定义的参数。下表总结了各种前缀编码的字母表大小以及定义它们的本文档章节.
| Prefix Code (前缀编码) | Alphabet Size (字母表大小) | Definition (定义) |
|---|---|---|
| literal (字面量) | 256 | |
| distance (距离) | 16 + NDIRECT + (48 << NPOSTFIX) | Section 4 |
| insert-and-copy length (插入和复制长度) | 704 | Section 5 |
| block count (块计数) | 26 | Section 6 |
| block type (块类型) | NBLTYPESx + 2 (x 是 I, L, 或 D) | Section 6 |
| context map (上下文映射) | NTREESx + RLEMAXx (x 是 L 或 D) | Section 7 |
3.4. Simple Prefix Codes (简单前缀编码)
每个前缀编码的压缩表示的前两位区分简单前缀编码 (Simple Prefix Code) 和复杂前缀编码 (Complex Prefix Code)。如果此值为 1,则如本节所述,随后是简单前缀编码。否则,如第 3.5 节所述,随后是复杂前缀编码.
简单前缀编码最多可以有四个具有非零编码长度的符号。简单前缀编码的格式如下:
2 bits: 值为 1 表示简单前缀编码
2 bits: NSYM - 1, 其中 NSYM = 编码的符号数量
NSYM 个符号, 每个使用 ALPHABET_BITS 位编码
1 bit: tree-select, 仅在 NSYM = 4 时出现
ALPHABET_BITS 的值取决于前缀编码的字母表: 它是可以表示字母表中所有符号的最小位数。例如,对于字面字节的字母表,ALPHABET_BITS 是 8。上面每个 NSYM 符号的值是 ALPHABET_BITS 宽度整数值的值。如果整数值大于或等于字母表大小,或者该值与之前的值相同,则应拒绝流为无效 (Invalid).
请注意,NSYM 符号可能不按排序顺序呈现。相同位长度的前缀编码必须按排序顺序分配给符号.
符号的(非零)编码长度可以如下重建:
-
如果
NSYM= 1,则一个符号的编码长度为零 -- 在使用此前缀编码对压缩数据流中的此符号进行编码时,不会发出实际位。类似地,在使用此前缀编码解码符号时,不会读取位,并返回一个符号. -
如果
NSYM= 2,则两个符号的编码长度均为 1. -
如果
NSYM= 3,则符号的编码长度按它们在简单前缀编码表示中出现的顺序为1, 2, 2. -
如果
NSYM= 4,则编码长度(按符号解码顺序)取决于tree-select位:2, 2, 2, 2(tree-select 位 0),或1, 2, 3, 3(tree-select 位 1).
3.5. Complex Prefix Codes (复杂前缀编码)
复杂前缀编码是规范前缀编码,如第 3.2 节所述,由编码长度序列 (Code Length Sequence) 定义。为了更大的紧凑性,编码长度序列本身使用前缀编码进行压缩。编码长度的字母表如下:
0..15: 表示编码长度 0..15
16: 复制前一个非零编码长度 3..6 次.
下 2 位表示重复长度
(0 = 3, ... , 3 = 6)
如果这是第一个编码长度,或所有先前的
编码长度为零,则编码长度 8 被
重复 3..6 次.
重复的编码长度编码 16 修改
前一个的重复计数,如下所示:
repeat count = (4 * (repeat count - 2)) +
(下 2 位上的 3..6)
示例: 编码 7, 16 (+2 位 11), 16 (+2 位 10)
将扩展为 22 个长度为 7 的编码长度
(1 + 4 * (6 - 2) + 5)
17: 重复编码长度 0 3..10 次.
下 3 位表示重复长度
(0 = 3, ... , 7 = 10)
重复的编码长度编码 17 修改
前一个的重复计数,如下所示:
repeat count = (8 * (repeat count - 2)) +
(下 3 位上的 3..10)
请注意,紧接在前一个 16 之后的 16 编码会修改前一个重复计数 (Repeat Count),该计数成为新的重复计数。对于紧接在 17 之后的 17,情况也是如此。连续三个或更多 16 编码或连续三个或更多 17 编码是可能的,每次都会修改计数。仅使用最终的重复计数。修改仅在相同编码后适用。16 重复不会修改紧接在前面的 17 计数,反之亦然.
编码长度为 0 表示字母表中的相应符号不会出现在压缩数据中,并且它不应参与先前给出的前缀编码构造算法。复杂前缀编码必须至少有两个非零编码长度.
编码长度字母表 (Code Length Alphabet) 上的前缀编码的位长度使用以下可变长度编码 (Variable-Length Code) 进行压缩(如它在压缩数据中出现的那样,其中位从右到左解析):
Symbol Code
------ ----
0 00
1 0111
2 011
3 10
4 01
5 1111
我们现在可以如下定义复杂前缀编码的格式:
-
2 bits:
HSKIP,跳过的编码长度数量 (Number of Skipped Code Lengths),可以具有 0、2 或 3 的值。跳过的长度被视为零。(HSKIP为 1 表示简单前缀编码。) -
编码长度字母表中符号的编码长度,按顺序给出:
1, 2, 3, 4, 0, 5, 17, 6, 16, 7, 8, 9, 10, 11, 12, 13, 14, 15。如果HSKIP为 2,则符号 1 和 2 的编码长度为零,第一个编码长度用于符号 3。如果HSKIP为 3,则符号 3 的编码长度也为零,第一个编码长度用于符号 4.编码长度符号的编码长度在 0 和 5 之间,并且它们根据上面的可变长度编码用 2..4 位表示。编码长度为 0 表示不使用相应的编码长度符号.
如果
HSKIP为 2 或 3,则相应数量的前导编码长度是隐式零,不出现在上面的编码长度序列中.如果至少有两个非零编码长度,则省略任何尾随零编码长度,即序列中的最后一个编码长度必须为非零。在这种情况下,所有非零编码长度上的
(32 >> code length)之和必须等于 32.如果已为整个编码长度字母表读取了长度,并且只有一个非零编码长度,则前缀编码有一个其编码长度为零的符号。在这种情况下,该符号导致压缩器不发出位,解压缩器不消耗位。解码此编码时立即返回该单个符号。一个例子是如果要表示的整个编码具有长度为 8 的符号。例如,以相等概率表示所有字面值的字面编码。在这种情况下,单个符号是 16,它重复前一个长度。在读取任何编码长度编码长度之前,前一个长度被视为 8.
-
编码长度符号序列 (Sequence of Code Length Symbols),最多为字母表的大小,使用编码长度前缀编码进行编码。必须省略任何尾随 0 或 17,即最后一个编码的编码长度符号必须在 1 和 16 之间。字母表中所有非零编码长度(包括使用 16 的重复编码编码的那些)上的
(32768 >> code length)之和必须等于 32768。如果重复前一个长度或重复零长度的次数总共导致的长度数超过字母表中的符号数,则应拒绝流为无效.
来源 (Source): RFC 7932, Section 3
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt