Skip to main content

8. Static Dictionary (静态字典)

官方英文原文 (Official English Text)

At any given point during decoding the compressed data, a reference to a duplicated string in the uncompressed data produced so far has a maximum backward distance value, which is the minimum of the window size and the number of uncompressed bytes produced. However, decoding a distance from the compressed stream, as described in Section 4, can produce distances that are greater than this maximum allowed value. In this case, the distance is treated as a reference to a word in the static dictionary given in Appendix A. The copy length for a static dictionary reference must be between 4 and 24. The static dictionary has three parts:

  • DICT[0..DICTSIZE], an array of bytes
  • DOFFSET[0..24], an array of byte-offset values for each length
  • NDBITS[0..24], an array of bit-depth values for each length

The number of static dictionary words for a given length is:

NWORDS[length] = 0                       (if length < 4)
NWORDS[length] = (1 << NDBITS[length]) (if length >= 4)

DOFFSET and DICTSIZE are defined by the following recursion:

DOFFSET[0] = 0
DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]

The offset of a word within the DICT array for a given length and index is:

offset(length, index) = DOFFSET[length] + index * length

Each static dictionary word has 121 different forms, given by applying a word transformation to a base word in the DICT array. The list of word transformations is given in Appendix B. The static dictionary word for a <length, distance> pair can be reconstructed as follows:

word_id = distance - (max allowed distance + 1)
index = word_id % NWORDS[length]
base_word = DICT[offset(length, index)..offset(length, index+1)-1]
transform_id = word_id >> NDBITS[length]

The string copied to the uncompressed stream is computed by applying the transformation to the base dictionary word. If transform_id is greater than 120, or the length is smaller than 4 or greater than 24, then the compressed stream should be rejected as invalid.

Each word transformation has the following form:

transform_i(word) = prefix_i + T_i(word) + suffix_i

where the _i subscript denotes the transform_id above. Each T_i is one of the following 21 elementary transforms:

Identity, FermentFirst, FermentAll,
OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9

The form of these elementary transforms is as follows:

Identity(word) = word

FermentFirst(word) = see below

FermentAll(word) = see below

OmitFirstk(word) = the last (length(word) - k) bytes of word, or
empty string if length(word) < k

OmitLastk(word) = the first (length(word) - k) bytes of word, or
empty string if length(word) < k

We define the FermentFirst and FermentAll transforms used in this specification by the following C language functions:

int Ferment(uint8_t* word, int word_len, int pos) {
if (word[pos] < 192) {
if (word[pos] >= 97 and word[pos] <= 122) {
word[pos] = word[pos] ^ 32;
}
return 1;
} else if (word[pos] < 224) {
if (pos + 1 < word_len) {
word[pos + 1] = word[pos + 1] ^ 32;
}
return 2;
} else {
if (pos + 2 < word_len) {
word[pos + 2] = word[pos + 2] ^ 5;
}
return 3;
}
}

void FermentFirst(uint8_t* word, int word_len) {
if (word_len > 0) {
Ferment(word, word_len, 0);
}
}

void FermentAll(uint8_t* word, int word_len) {
int i = 0;
while (i < word_len) {
i += Ferment(word, word_len, i);
}
}

Appendix B contains the list of transformations by specifying the prefix, elementary transform and suffix components of each of them. Note that the OmitFirst8 elementary transform is not used in the list of transformations. The strings in Appendix B are in C-string format with respect to escape (backslash) characters.

The maximum number of additional bytes that a transform may add to a base word is 13. Since the largest base word is 24 bytes long, a buffer of 38 bytes is sufficient to store any transformed words (counting a terminating zero byte).


中文翻译 (Chinese Translation)

在解码压缩数据期间的任何给定点,对到目前为止生成的未压缩数据中重复字符串的引用具有最大后向距离值 (Maximum Backward Distance Value),该值是窗口大小 (Window Size) 和已生成的未压缩字节数 (Number of Uncompressed Bytes Produced) 的最小值。但是,如第 4 节所述,从压缩流中解码距离可能会产生大于此最大允许值的距离。在这种情况下,距离被视为对附录 A 中给出的静态字典 (Static Dictionary) 中词 (Word) 的引用。静态字典引用的复制长度 (Copy Length) 必须在 4 到 24 之间。静态字典有三个部分:

  • DICT[0..DICTSIZE]: 字节数组 (Array of Bytes)
  • DOFFSET[0..24]: 每个长度的字节偏移值 (Byte-Offset Value) 数组
  • NDBITS[0..24]: 每个长度的位深度值 (Bit-Depth Value) 数组

给定长度的静态字典词数 (Number of Static Dictionary Words) 为:

NWORDS[length] = 0                       (if length < 4)
NWORDS[length] = (1 << NDBITS[length]) (if length >= 4)

DOFFSETDICTSIZE 由以下递归 (Recursion) 定义:

DOFFSET[0] = 0
DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]

给定长度和索引的词在 DICT 数组中的偏移量 (Offset) 为:

offset(length, index) = DOFFSET[length] + index * length

每个静态字典词都有 121 种不同的形式 (Different Forms),通过对 DICT 数组中的基本词 (Base Word) 应用词变换 (Word Transformation) 来生成。词变换列表在附录 B 中给出。<length, distance> 对的静态字典词可以如下重建:

word_id = distance - (max allowed distance + 1)
index = word_id % NWORDS[length]
base_word = DICT[offset(length, index)..offset(length, index+1)-1]
transform_id = word_id >> NDBITS[length]

复制到未压缩流的字符串通过对基本字典词应用变换 (Applying Transformation) 来计算。如果 transform_id 大于 120,或长度小于 4 或大于 24,则应拒绝压缩流为无效 (Invalid).

每个词变换具有以下形式:

transform_i(word) = prefix_i + T_i(word) + suffix_i

其中 _i 下标表示上面的 transform_id。每个 T_i 是以下 21 个基本变换 (Elementary Transform) 之一:

Identity, FermentFirst, FermentAll,
OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9

这些基本变换的形式如下:

Identity(word) = word

FermentFirst(word) = 见下文

FermentAll(word) = 见下文

OmitFirstk(word) = word 的最后 (length(word) - k) 个字节,
如果 length(word) < k 则为空字符串

OmitLastk(word) = word 的前 (length(word) - k) 个字节,
如果 length(word) < k 则为空字符串

我们通过以下 C 语言函数定义本规范中使用的 FermentFirstFermentAll 变换:

int Ferment(uint8_t* word, int word_len, int pos) {
if (word[pos] < 192) {
if (word[pos] >= 97 and word[pos] <= 122) {
word[pos] = word[pos] ^ 32;
}
return 1;
} else if (word[pos] < 224) {
if (pos + 1 < word_len) {
word[pos + 1] = word[pos + 1] ^ 32;
}
return 2;
} else {
if (pos + 2 < word_len) {
word[pos + 2] = word[pos + 2] ^ 5;
}
return 3;
}
}

void FermentFirst(uint8_t* word, int word_len) {
if (word_len > 0) {
Ferment(word, word_len, 0);
}
}

void FermentAll(uint8_t* word, int word_len) {
int i = 0;
while (i < word_len) {
i += Ferment(word, word_len, i);
}
}

附录 B 包含通过指定每个变换的前缀 (Prefix)、基本变换 (Elementary Transform) 和后缀 (Suffix) 组件来列出的变换列表。请注意,OmitFirst8 基本变换未在变换列表中使用。附录 B 中的字符串采用 C 字符串格式,关于转义(反斜杠)字符.

变换可能添加到基本词的最大附加字节数 (Maximum Number of Additional Bytes) 为 13。由于最大的基本词长度为 24 字节,因此 38 字节的缓冲区 (Buffer) 足以存储任何变换的词(包括终止零字节 (Terminating Zero Byte)).

关键概念说明

静态字典结构:

  • 预定义的常用词汇表
  • 长度范围: 4-24 字节
  • 三个核心数组: DICT, DOFFSET, NDBITS

词重建过程:

  1. 计算 word_id (基于距离)
  2. 确定 index (词索引)
  3. 提取 base_word (基本词)
  4. 计算 transform_id (变换 ID)
  5. 应用变换得到最终词

121 种变换:

  • 121 = 组合数量
  • 每个基本词可以变换为 121 种形式
  • 由前缀 + 变换 + 后缀组成

21 种基本变换:

  • Identity: 恒等变换
  • FermentFirst: 发酵首字符(大小写转换)
  • FermentAll: 发酵所有字符
  • OmitFirst1-9: 省略前 1-9 个字节
  • OmitLast1-9: 省略后 1-9 个字节

Ferment 变换逻辑:

  • 处理 UTF-8 编码
  • < 192: 单字节字符,XOR 32 (大小写切换)
  • < 224: 双字节字符,第二字节 XOR 32
  • >= 224: 三字节字符,第三字节 XOR 5

缓冲区要求:

  • 最大基本词: 24 字节
  • 最大附加: 13 字节
  • 总计需要: 38 字节 (含终止符)

来源 (Source): RFC 7932, Section 8
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt