Skip to main content

7. Context Modeling (上下文建模)

官方英文原文 (Official English Text)

As described in Section 2, the prefix tree used to encode a literal byte or a distance code depends on the block type and the context ID. This section specifies how to compute the context ID for a particular literal and distance code and how to encode the context map that maps a <block type, context ID> pair to the index of a prefix code in the array of literal and distance prefix codes.

7.1. Context Modes and Context ID Lookup for Literals

The context for encoding the next literal is defined by the last two bytes in the stream (p1, p2, where p1 is the most recent byte), regardless of whether these bytes are produced by uncompressed meta-blocks, backward references, static dictionary references, or by literal insertions. At the start of the stream, p1 and p2 are initialized to zero.

There are four methods, called context modes, to compute the Context ID:

  • LSB6, where the Context ID is the value of six least significant bits of p1,

  • MSB6, where the Context ID is the value of six most significant bits of p1,

  • UTF8, where the Context ID is a complex function of p1, p2, optimized for text compression, and

  • Signed, where Context ID is a complex function of p1, p2, optimized for compressing sequences of signed integers.

The Context ID for the UTF8 and Signed context modes is computed using the following lookup tables Lut0, Lut1, and Lut2.

Lut0 :=
0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 4, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3

Lut1 :=
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

Lut2 :=
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

The lengths and the CRC-32 check values (see Appendix C) of each of these tables as a sequence of bytes are as follows:

Table    Length    CRC-32
----- ------ ------
Lut0 256 0x8e91efb7
Lut1 256 0xd01a32f4
Lut2 256 0x0dd7a0d6

Given p1 is the last uncompressed byte and p2 is the second-to-last uncompressed byte, the context IDs can be computed as follows:

For LSB6:    Context ID = p1 & 0x3f
For MSB6: Context ID = p1 >> 2
For UTF8: Context ID = Lut0[p1] | Lut1[p2]
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]

From the lookup tables defined above and the operations to compute the context IDs, we can see that context IDs for literals are in the range of 0..63.

The context modes LSB6, MSB6, UTF8, and Signed are denoted by integers 0, 1, 2, 3.

A context mode is defined for each literal block type and they are stored in a consecutive array of bits in the meta-block header, always two bits per block type.

7.2. Context ID for Distances

The context for encoding a distance code is defined by the copy length corresponding to the distance. The context IDs are 0, 1, 2, and 3 for copy lengths 2, 3, 4, and more than 4, respectively.

7.3. Encoding of the Context Map

There are two context maps, one for literals and one for distances. The size of the context map is 64 * NBLTYPESL for literals, and 4 * NBLTYPESD for distances. Each value in the context map is an integer between 0 and 255, indicating the index of the prefix code to be used when encoding the next literal or distance.

The context maps are two-dimensional matrices, encoded as one-dimensional arrays:

CMAPL[0..(64 * NBLTYPESL - 1)]
CMAPD[0..(4 * NBLTYPESD - 1)]

The index of the prefix code for encoding a literal or distance code with block type, BTYPE_x, and context ID, CIDx, is:

index of literal prefix code = CMAPL[64 * BTYPE_L + CIDL]
index of distance prefix code = CMAPD[4 * BTYPE_D + CIDD]

The values of the context map are encoded with the combination of run length encoding for zero values and prefix coding. Let RLEMAX denote the number of run length codes and NTREES denote the maximum value in the context map plus one. NTREES must equal the number of different values in the context map; in other words, the different values in the context map must be the [0..NTREES-1] interval. The alphabet of the prefix code has the following RLEMAX + NTREES symbols:

0: value zero
1: repeat a zero 2 to 3 times, read 1 bit for repeat length
2: repeat a zero 4 to 7 times, read 2 bits for repeat length
...
RLEMAX: repeat a zero (1 << RLEMAX) to (1 << (RLEMAX+1))-1
times, read RLEMAX bits for repeat length
RLEMAX + 1: value 1
...
RLEMAX + NTREES - 1: value NTREES - 1

If RLEMAX = 0, the run length coding is not used and the symbols of the alphabet are directly the values in the context map. We can now define the format of the context map (the same format is used for literal and distance context maps):

  • 1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values 1..16 are encoded with bit pattern xxxx1 (so 01001 is 5)

  • Prefix code with alphabet size NTREES + RLEMAX

  • Context map size values encoded with the above prefix code and run length coding for zero values. If a run length would result in more lengths in total than the size of the context map, then the stream should be rejected as invalid.

  • 1 bit: IMTF bit, if set, we do an inverse move-to-front transform on the values in the context map to get the prefix code indexes

Note that RLEMAX may be larger than what is necessary to represent the longest run of zeros. Additionally, the NTREES value is encoded before the context map, as described in Section 9.2.

We define the inverse move-to-front transform used in this specification by the following C language function:

void InverseMoveToFrontTransform(uint8_t* v, int v_len) {
uint8_t mtf[256];
int i;
for (i = 0; i < 256; ++i) {
mtf[i] = (uint8_t)i;
}
for (i = 0; i < v_len; ++i) {
uint8_t index = v[i];
uint8_t value = mtf[index];
v[i] = value;
for (; index; --index) {
mtf[index] = mtf[index - 1];
}
mtf[0] = value;
}
}

Note that the inverse move-to-front transform will not produce values outside the [0..NTREES-1] interval.


中文翻译 (Chinese Translation)

如第 2 节所述,用于编码字面字节 (Literal Byte) 或距离编码 (Distance Code) 的前缀树 (Prefix Tree) 取决于块类型 (Block Type) 和上下文 ID (Context ID)。本节指定如何计算特定字面量和距离编码的上下文 ID,以及如何编码上下文映射 (Context Map),该映射将 <block type, context ID> 对映射到字面量和距离前缀编码数组中的前缀编码的索引.

7.1. Context Modes and Context ID Lookup for Literals (字面量的上下文模式和上下文 ID 查找)

用于编码下一个字面量的上下文 (Context) 由流中的最后两个字节 (p1, p2, 其中 p1 是最近的字节) 定义,无论这些字节是由未压缩元块 (Uncompressed Meta-Block)、后向引用 (Backward Reference)、静态字典引用 (Static Dictionary Reference) 还是字面插入 (Literal Insertion) 产生的。在流开始时,p1p2 初始化为零.

有四种方法,称为上下文模式 (Context Mode),用于计算上下文 ID:

  • LSB6: 上下文 ID 是 p1 的六个最低有效位 (Least Significant Bits) 的值,

  • MSB6: 上下文 ID 是 p1 的六个最高有效位 (Most Significant Bits) 的值,

  • UTF8: 上下文 ID 是 p1p2 的复杂函数 (Complex Function),针对文本压缩 (Text Compression) 进行了优化,以及

  • Signed: 上下文 ID 是 p1p2 的复杂函数,针对压缩有符号整数序列 (Sequences of Signed Integers) 进行了优化.

UTF8 和 Signed 上下文模式的上下文 ID 使用以下查找表 (Lookup Table) Lut0Lut1Lut2 计算.

Lut0 (256 字节,CRC-32: 0x8e91efb7):

  0,  0,  0,  0,  0,  0,  0,  0,  0,  4,  4,  0,  0,  4,  0,  0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3

Lut1 (256 字节,CRC-32: 0xd01a32f4):

  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

Lut2 (256 字节,CRC-32: 0x0dd7a0d6):

  0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

每个表作为字节序列的长度和 CRC-32 校验值 (Check Value) (见附录 C) 如下:

Table    Length    CRC-32
----- ------ ------
Lut0 256 0x8e91efb7
Lut1 256 0xd01a32f4
Lut2 256 0x0dd7a0d6

给定 p1 是最后一个未压缩字节,p2 是倒数第二个未压缩字节,上下文 ID 可以如下计算:

For LSB6:    Context ID = p1 & 0x3f
For MSB6: Context ID = p1 >> 2
For UTF8: Context ID = Lut0[p1] | Lut1[p2]
For Signed: Context ID = (Lut2[p1] << 3) | Lut2[p2]

从上面定义的查找表和计算上下文 ID 的操作可以看出,字面量的上下文 ID 在 0..63 的范围内.

上下文模式 LSB6、MSB6、UTF8 和 Signed 分别用整数 0、1、2、3 表示.

为每个字面块类型定义上下文模式,并将它们存储在元块头部的连续位数组中,每个块类型始终为两位.

7.2. Context ID for Distances (距离的上下文 ID)

用于编码距离编码的上下文由对应于距离的复制长度 (Copy Length) 定义。对于复制长度 2、3、4 和大于 4,上下文 ID 分别为 0、1、2 和 3.

7.3. Encoding of the Context Map (上下文映射编码)

有两个上下文映射,一个用于字面量,一个用于距离。上下文映射的大小对于字面量为 64 * NBLTYPESL,对于距离为 4 * NBLTYPESD。上下文映射中的每个值都是 0 到 255 之间的整数,指示编码下一个字面量或距离时要使用的前缀编码的索引 (Index of Prefix Code).

上下文映射是二维矩阵 (Two-Dimensional Matrix),编码为一维数组 (One-Dimensional Array):

CMAPL[0..(64 * NBLTYPESL - 1)]
CMAPD[0..(4 * NBLTYPESD - 1)]

用块类型 BTYPE_x 和上下文 ID CIDx 编码字面量或距离编码的前缀编码的索引为:

index of literal prefix code = CMAPL[64 * BTYPE_L + CIDL]
index of distance prefix code = CMAPD[4 * BTYPE_D + CIDD]

上下文映射的值使用零值的游程编码 (Run Length Encoding, RLE) 和前缀编码 (Prefix Coding) 的组合进行编码。设 RLEMAX 表示游程编码的数量 (Number of Run Length Codes),NTREES 表示上下文映射中的最大值加一。NTREES 必须等于上下文映射中不同值的数量; 换句话说,上下文映射中的不同值必须是 [0..NTREES-1] 区间。前缀编码的字母表具有以下 RLEMAX + NTREES 个符号:

0: 值零
1: 重复零 2 到 3 次,读取 1 位表示重复长度
2: 重复零 4 到 7 次,读取 2 位表示重复长度
...
RLEMAX: 重复零 (1 << RLEMAX) 到 (1 << (RLEMAX+1))-1
次,读取 RLEMAX 位表示重复长度
RLEMAX + 1: 值 1
...
RLEMAX + NTREES - 1: 值 NTREES - 1

如果 RLEMAX = 0,则不使用游程编码,字母表的符号直接是上下文映射中的值。我们现在可以如下定义上下文映射的格式(字面量和距离上下文映射使用相同的格式):

  • 1..5 bits: RLEMAX, 0 用一个 0 位编码,值 1..16 用位模式 xxxx1 编码(所以 01001 是 5)

  • 前缀编码,字母表大小为 NTREES + RLEMAX

  • 上下文映射大小值,使用上述前缀编码和零值的游程编码进行编码。如果游程导致总长度超过上下文映射的大小,则应拒绝流为无效.

  • 1 bit: IMTF 位,如果设置,我们对上下文映射中的值执行逆移至前端变换 (Inverse Move-to-Front Transform, IMTF),以获取前缀编码索引.

请注意,RLEMAX 可能大于表示最长零值序列所需的值。此外,NTREES 值在上下文映射之前编码,如第 9.2 节所述.

我们通过以下 C 语言函数定义本规范中使用的逆移至前端变换:

void InverseMoveToFrontTransform(uint8_t* v, int v_len) {
uint8_t mtf[256];
int i;
for (i = 0; i < 256; ++i) {
mtf[i] = (uint8_t)i;
}
for (i = 0; i < v_len; ++i) {
uint8_t index = v[i];
uint8_t value = mtf[index];
v[i] = value;
for (; index; --index) {
mtf[index] = mtf[index - 1];
}
mtf[0] = value;
}
}

请注意,逆移至前端变换不会产生 [0..NTREES-1] 区间之外的值.

关键概念说明

四种上下文模式:

  • LSB6: 提取最低 6 位,适用于通用数据
  • MSB6: 提取最高 6 位,适用于大端数据
  • UTF8: 针对 UTF-8 文本优化,使用双字节上下文
  • Signed: 针对有符号整数序列优化

查找表验证:

  • 三个 256 字节查找表
  • 每个表都有 CRC-32 校验值用于验证完整性

上下文映射结构:

  • 字面量: 64 * NBLTYPESL 个条目
  • 距离: 4 * NBLTYPESD 个条目
  • 游程编码优化零值

IMTF 变换:

  • 用于减少上下文映射的熵
  • 标准移至前端算法的逆变换
  • 不产生范围外的值

来源 (Source): RFC 7932, Section 7
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt