9. Compressed Data Format (压缩数据格式)
官方英文原文 (Official English Text)
In this section, we describe the format of the compressed data set in terms of the format of the individual data items described in the previous sections.
9.1. Format of the Stream Header
The stream header has only the following one field:
1..7 bits: WBITS, a value in the range 10..24, encoded with the following variable-length code (as it appears in the compressed data, where the bits are parsed from right to left):
Value Bit Pattern
----- -----------
10 0100001
11 0110001
12 1000001
13 1010001
14 1100001
15 1110001
16 0
17 0000001
18 0011
19 0101
20 0111
21 1001
22 1011
23 1101
24 1111
Note that bit pattern 0010001 is invalid and must not be used.
The size of the sliding window, which is the maximum value of any non-dictionary reference backward distance, is given by the following formula:
window size = (1 << WBITS) - 16
9.2. Format of the Meta-Block Header
A compliant compressed data set has at least one meta-block. Each meta-block contains a header with information about the uncompressed length of the meta-block, and a bit signaling if the meta-block is the last one. The format of the meta-block header is the following:
1 bit: ISLAST, set to 1 if this is the last meta-block
1 bit: ISLASTEMPTY, if set to 1, the meta-block is empty; this field is only present if ISLAST bit is set -- if it is 1, then the meta-block and the brotli stream ends at that bit, with any remaining bits in the last byte of the compressed stream filled with zeros (if the fill bits are not zero, then the stream should be rejected as invalid)
2 bits: MNIBBLES, number of nibbles to represent the uncompressed length, encoded with the following fixed-length code:
Value Bit Pattern
----- -----------
0 11
4 00
5 01
6 10
If MNIBBLES is 0, the meta-block is empty, i.e., it does not generate any uncompressed data. In this case, the rest of the meta-block has the following format:
1 bit: reserved, must be zero
2 bits: MSKIPBYTES, number of bytes to represent metadata length
MSKIPBYTES * 8 bits: MSKIPLEN - 1, where MSKIPLEN is the number of
metadata bytes; this field is only present if MSKIPBYTES is
positive; otherwise, MSKIPLEN is 0 (if MSKIPBYTES is greater
than 1, and the last byte is all zeros, then the stream should
be rejected as invalid)
0..7 bits: fill bits until the next byte boundary, must be all zeros
MSKIPLEN bytes of metadata, not part of the uncompressed data or
the sliding window
MNIBBLES * 4 bits: MLEN - 1, where MLEN is the length of the meta-block uncompressed data in bytes (if MNIBBLES is greater than 4, and the last nibble is all zeros, then the stream should be rejected as invalid)
1 bit: ISUNCOMPRESSED, if set to 1, any bits of compressed data up to the next byte boundary are ignored, and the rest of the meta-block contains MLEN bytes of literal data; this field is only present if the ISLAST bit is not set (if the ignored bits are not all zeros, the stream should be rejected as invalid)
1..11 bits: NBLTYPESL, number of literal block types, encoded with the following variable-length code (as it appears in the compressed data, where the bits are parsed from right to left, so 0110111 has the value 12):
Value Bit Pattern
----- -----------
1 0
2 0001
3..4 x0011
5..8 xx0101
9..16 xxx0111
17..32 xxxx1001
33..64 xxxxx1011
65..128 xxxxxx1101
129..256 xxxxxxx1111
Prefix code over the block type code alphabet for literal block types, appears only if NBLTYPESL >= 2
Prefix code over the block count code alphabet for literal block counts, appears only if NBLTYPESL >= 2
Block count code + extra bits for first literal block count, appears only if NBLTYPESL >= 2
1..11 bits: NBLTYPESI, number of insert-and-copy block types, encoded with the same variable-length code as above
Prefix code over the block type code alphabet for insert-and-copy block types, appears only if NBLTYPESI >= 2
Prefix code over the block count code alphabet for insert-and-copy block counts, appears only if NBLTYPESI >= 2
Block count code + extra bits for first insert-and-copy block count, appears only if NBLTYPESI >= 2
1..11 bits: NBLTYPESD, number of distance block types, encoded with the same variable-length code as above
Prefix code over the block type code alphabet for distance block types, appears only if NBLTYPESD >= 2
Prefix code over the block count code alphabet for distance block counts, appears only if NBLTYPESD >= 2
Block count code + extra bits for first distance block count, appears only if NBLTYPESD >= 2
2 bits: NPOSTFIX, parameter used in the distance coding
4 bits: four most significant bits of NDIRECT, to get the actual value of the parameter NDIRECT, left-shift this four-bit number by NPOSTFIX bits
NBLTYPESL * 2 bits: context mode for each literal block type
1..11 bits: NTREESL, number of literal prefix trees, encoded with the same variable-length code as NBLTYPESL
Literal context map, encoded as described in Section 7.3, appears only if NTREESL >= 2; otherwise, the context map has only zero values
1..11 bits: NTREESD, number of distance prefix trees, encoded with the same variable-length code as NBLTYPESD
Distance context map, encoded as described in Section 7.3, appears only if NTREESD >= 2; otherwise, the context map has only zero values
NTREESL prefix codes for literals
NBLTYPESI prefix codes for insert-and-copy lengths
NTREESD prefix codes for distances
9.3. Format of the Meta-Block Data
The compressed data part of a meta-block consists of a series of commands. Each command has the following format:
Block type code for next insert-and-copy block type, appears only if NBLTYPESI >= 2 and the previous insert-and-copy block count is zero
Block count code + extra bits for next insert-and-copy block count, appears only if NBLTYPESI >= 2 and the previous insert-and-copy block count is zero
Insert-and-copy length, encoded as in Section 5, using the insert-and-copy length prefix code with the current insert-and-copy block type index
Insert length number of literals, with the following format:
-
Block type code for next literal block type, appears only if
NBLTYPESL >= 2and the previous literal block count is zero -
Block count code + extra bits for next literal block count, appears only if
NBLTYPESL >= 2and the previous literal block count is zero -
Next byte of the uncompressed data, encoded with the literal prefix code with the index determined by the previous two bytes of the uncompressed data, the current literal block type, and the context map, as described in Section 7.3
Block type code for next distance block type, appears only if NBLTYPESD >= 2 and the previous distance block count is zero
Block count code + extra bits for next distance block count, appears only if NBLTYPESD >= 2 and the previous distance block count is zero
Distance code, encoded as in Section 4, using the distance prefix code with the index determined by the copy length, the current distance block type, and the distance context map, as described in Section 7.3, appears only if the distance code is not an implicit 0, as indicated by the insert-and-copy length code
The number of commands in the meta-block is such that the sum of the uncompressed bytes produced (i.e., the number of literals inserted plus the number of bytes copied from past data or generated from the static dictionary) over all the commands gives the uncompressed length, MLEN encoded in the meta-block header.
If the total number of uncompressed bytes produced after the insert part of the last command equals MLEN, then the copy length of the last command is ignored and will not produce any uncompressed output. In this case, the copy length of the last command can have any value. In any other case, if the number of literals to insert, the copy length, or the resulting dictionary word length would cause MLEN to be exceeded, then the stream should be rejected as invalid.
If the last command of the last non-empty meta-block does not end on a byte boundary, the unused bits in the last byte must be zeros.
中文翻译 (Chinese Translation)
在本节中,我们根据前面各节中描述的各个数据项的格式来描述压缩数据集 (Compressed Data Set) 的格式.
9.1. Format of the Stream Header (流头部格式)
流头部 (Stream Header) 只有以下一个字段:
1..7 bits: WBITS, 范围 10..24 中的值,使用以下可变长度编码 (Variable-Length Code) 进行编码(如它在压缩数据中出现的那样,其中位从右到左解析):
Value Bit Pattern
----- -----------
10 0100001
11 0110001
12 1000001
13 1010001
14 1100001
15 1110001
16 0
17 0000001
18 0011
19 0101
20 0111
21 1001
22 1011
23 1101
24 1111
请注意,位模式 (Bit Pattern) 0010001 是无效的,不得使用.
滑动窗口 (Sliding Window) 的大小,即任何非字典引用后向距离 (Non-Dictionary Reference Backward Distance) 的最大值,由以下公式给出:
window size = (1 << WBITS) - 16
9.2. Format of the Meta-Block Header (元块头部格式)
符合标准的压缩数据集至少有一个元块 (Meta-Block)。每个元块都包含一个头部 (Header),其中包含有关元块的未压缩长度 (Uncompressed Length) 的信息,以及一个位信号指示元块是否是最后一个。元块头部的格式如下:
1 bit: ISLAST, 如果这是最后一个元块则设置为 1
1 bit: ISLASTEMPTY, 如果设置为 1,则元块为空; 此字段仅在 ISLAST 位设置时出现 -- 如果它为 1,则元块和 brotli 流在该位结束,压缩流的最后一个字节中的任何剩余位用零填充(如果填充位不为零,则应拒绝流为无效)
2 bits: MNIBBLES, 表示未压缩长度的半字节数 (Nibble),使用以下固定长度编码 (Fixed-Length Code) 进行编码:
Value Bit Pattern
----- -----------
0 11
4 00
5 01
6 10
如果 MNIBBLES 为 0,则元块为空,即它不生成任何未压缩数据。在这种情况下,元块的其余部分具有以下格式:
1 bit: 保留位 (Reserved Bit),必须为零
2 bits: MSKIPBYTES, 表示元数据长度 (Metadata Length) 的字节数
MSKIPBYTES * 8 bits: MSKIPLEN - 1, 其中 MSKIPLEN 是元数据字节数;
此字段仅在 MSKIPBYTES 为正时出现; 否则,MSKIPLEN 为 0
(如果 MSKIPBYTES 大于 1,并且最后一个字节全为零,
则应拒绝流为无效)
0..7 bits: 填充位 (Fill Bits) 直到下一个字节边界,必须全为零
MSKIPLEN 字节的元数据,不是未压缩数据或滑动窗口的一部分
MNIBBLES * 4 bits: MLEN - 1, 其中 MLEN 是元块未压缩数据的字节长度(如果 MNIBBLES 大于 4,并且最后一个半字节全为零,则应拒绝流为无效)
1 bit: ISUNCOMPRESSED, 如果设置为 1,则忽略到下一个字节边界的压缩数据的任何位,并且元块的其余部分包含 MLEN 字节的字面数据 (Literal Data); 此字段仅在 ISLAST 位未设置时出现(如果忽略的位不全为零,则应拒绝流为无效)
1..11 bits: NBLTYPESL, 字面块类型数量 (Number of Literal Block Types),使用以下可变长度编码进行编码(如它在压缩数据中出现的那样,其中位从右到左解析,因此 0110111 的值为 12):
Value Bit Pattern
----- -----------
1 0
2 0001
3..4 x0011
5..8 xx0101
9..16 xxx0111
17..32 xxxx1001
33..64 xxxxx1011
65..128 xxxxxx1101
129..256 xxxxxxx1111
字面块类型的块类型编码字母表 (Block Type Code Alphabet) 上的前缀编码,仅在 NBLTYPESL >= 2 时出现
字面块计数的块计数编码字母表 (Block Count Code Alphabet) 上的前缀编码,仅在 NBLTYPESL >= 2 时出现
第一个字面块计数的块计数编码 + 额外位 (Extra Bits),仅在 NBLTYPESL >= 2 时出现
1..11 bits: NBLTYPESI, 插入和复制块类型数量 (Number of Insert-and-Copy Block Types),使用与上述相同的可变长度编码进行编码
插入和复制块类型的块类型编码字母表上的前缀编码,仅在 NBLTYPESI >= 2 时出现
插入和复制块计数的块计数编码字母表上的前缀编码,仅在 NBLTYPESI >= 2 时出现
第一个插入和复制块计数的块计数编码 + 额外位,仅在 NBLTYPESI >= 2 时出现
1..11 bits: NBLTYPESD, 距离块类型数量 (Number of Distance Block Types),使用与上述相同的可变长度编码进行编码
距离块类型的块类型编码字母表上的前缀编码,仅在 NBLTYPESD >= 2 时出现
距离块计数的块计数编码字母表上的前缀编码,仅在 NBLTYPESD >= 2 时出现
第一个距离块计数的块计数编码 + 额外位,仅在 NBLTYPESD >= 2 时出现
2 bits: NPOSTFIX, 距离编码 (Distance Coding) 中使用的参数
4 bits: NDIRECT 的四个最高有效位 (Most Significant Bits), 要获得参数 NDIRECT 的实际值,将此四位数字左移 NPOSTFIX 位
NBLTYPESL * 2 bits: 每个字面块类型的上下文模式 (Context Mode)
1..11 bits: NTREESL, 字面前缀树数量 (Number of Literal Prefix Trees),使用与 NBLTYPESL 相同的可变长度编码进行编码
字面上下文映射 (Literal Context Map),如第 7.3 节所述编码,仅在 NTREESL >= 2 时出现; 否则,上下文映射只有零值
1..11 bits: NTREESD, 距离前缀树数量 (Number of Distance Prefix Trees),使用与 NBLTYPESD 相同的可变长度编码进行编码
距离上下文映射 (Distance Context Map),如第 7.3 节所述编码,仅在 NTREESD >= 2 时出现; 否则,上下文映射只有零值
字面量的 NTREESL 个前缀编码
插入和复制长度的 NBLTYPESI 个前缀编码
距离的 NTREESD 个前缀编码
9.3. Format of the Meta-Block Data (元块数据格式)
元块的压缩数据部分由一系列命令 (Commands) 组成。每个命令具有以下格式:
下一个插入和复制块类型的块类型编码, 仅在 NBLTYPESI >= 2 且前一个插入和复制块计数为零时出现
下一个插入和复制块计数的块计数编码 + 额外位, 仅在 NBLTYPESI >= 2 且前一个插入和复制块计数为零时出现
插入和复制长度 (Insert-and-Copy Length), 如第 5 节所述编码,使用具有当前插入和复制块类型索引的插入和复制长度前缀编码
插入长度个字面量 (Insert Length Number of Literals), 具有以下格式:
-
下一个字面块类型的块类型编码,仅在
NBLTYPESL >= 2且前一个字面块计数为零时出现 -
下一个字面块计数的块计数编码 + 额外位,仅在
NBLTYPESL >= 2且前一个字面块计数为零时出现 -
未压缩数据的下一个字节 (Next Byte of Uncompressed Data),使用由未压缩数据的前两个字节、当前字面块类型和上下文映射确定的索引的字面前缀编码进行编码,如第 7.3 节所述
下一个距离块类型的块类型编码, 仅在 NBLTYPESD >= 2 且前一个距离块计数为零时出现
下一个距离块计数的块计数编码 + 额外位, 仅在 NBLTYPESD >= 2 且前一个距离块计数为零时出现
距离编码 (Distance Code), 如第 4 节所述编码,使用由复制长度、当前距离块类型和距离上下文映射确定的索引的距离前缀编码,如第 7.3 节所述,仅在距离编码不是由插入和复制长度编码指示的隐式 0 时出现
元块中的命令数使得所有命令上生成的未压缩字节总和(即插入的字面量数加上从过去数据复制或从静态字典生成的字节数)给出元块头部中编码的未压缩长度 MLEN.
如果最后一个命令的插入部分之后生成的未压缩字节总数等于 MLEN,则忽略最后一个命令的复制长度,并且不会产生任何未压缩输出。在这种情况下,最后一个命令的复制长度可以具有任何值。在任何其他情况下,如果要插入的字面量数、复制长度或生成的字典词长度会导致超过 MLEN,则应拒绝流为无效 (Invalid).
如果最后一个非空元块的最后一个命令未在字节边界 (Byte Boundary) 上结束,则最后一个字节中的未使用位 (Unused Bits) 必须为零.
来源 (Source): RFC 7932, Section 9
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt