Skip to main content

2. Compressed Representation Overview (压缩表示概述)

官方英文原文 (Official English Text)

A compressed data set consists of a header and a series of meta-blocks. Each meta-block compresses a sequence of input bytes. The header contains the size of the uncompressed data, if known in advance, otherwise it contains zero. As such, it is legal to have a compressed stream with zero header bytes.

A meta-block consists of a meta-block header and meta-block data. The meta-block header describes how the data part should be interpreted. It contains the size of the data part, along with two bits to signal if the data is uncompressed or the last meta-block. The interpretation of the meta-block data depends on these bits.

The uncompressed size of a meta-block may be zero. For example, an empty meta-block may be sent after a previous meta-block of uncompressed type or to simply signal end of data.

There are three different kinds of meta-blocks:

Compressed meta-blocks: These meta-blocks contain LZ77 prefix-coded data. Each compressed meta-block is divided into a prefix code section and a data section. The prefix code section contains prefix codes to be used in the data section. The data section contains a sequence of LZ77 backward references, literals, and static dictionary references. Each LZ77 backward reference is a ⟨length, distance⟩ pair, where length denotes the number of bytes to be copied and distance denotes how many bytes backward in the uncompressed data stream to start copying. A distance of 1 means that the last byte is to be copied distance times. A distance can also be translated to a position in the static dictionary. Static dictionary references are specially handled backward references; see Section 8 for details. Literals and backward references are stored in the same data structure, called an insert-and-copy length code. In this way, one value determines the insert length, which is the number of literals immediately following the code, and the copy length, which is the length of the next backward reference. The distance of the backward reference is determined by the next distance code. The last insert-and-copy length code can have a copy length of 0; in this case, the last symbols in the data are literal values.

Uncompressed meta-blocks: These meta-blocks contain uncompressed data.

Meta-data meta-blocks: These meta-blocks are reserved for future use. They are skipped by the decoder. This document does not define the format of these meta-blocks. Applications can define such formats if needed.

The structure of the compressed meta-block is as follows:

  • A block type and block count for each of the three categories of literals, insert-and-copy lengths, and distances. The block type has a small integer value and identifies a set of prefix codes to be used for the next block count values. The block count is a number of symbols for which the prefix codes and context models in the current block type are used.

  • A prefix code for the block types for literals (BTYPE-L).

  • A prefix code for the block counts for literals (BLEN-L).

  • A prefix code for the block types for insert-and-copy lengths (BTYPE-I).

  • A prefix code for the block counts for insert-and-copy lengths (BLEN-I).

  • A prefix code for the block types for distances (BTYPE-D).

  • A prefix code for the block counts for distances (BLEN-D).

  • Possibly, a distance parameter NPOSTFIX and NDIRECT (if the meta-block is the first one, they are set to 0; otherwise they can be set to any non-negative values, see Section 4).

  • Number of literal block types NBLTYPES-L and number of context modes CMODE (if NBLTYPES-L is 0, it is interpreted as 1).

  • For each of the NBLTYPES-L literal block types:

    • A context mode (zero, one, or two) for the literals.
    • Possibly, a literal context map (if NBLTYPES-L is 1, the context map does not need to be encoded; in that case, the context ID is always 0).
  • Number of insert-and-copy block types NBLTYPES-I (if it is 0, it is interpreted as 1).

  • Possibly, an insert-and-copy context map (if NBLTYPES-I is 1, the context map does not need to be encoded; in that case, the context ID is always 0).

  • Number of distance block types NBLTYPES-D (if it is 0, it is interpreted as 1).

  • Possibly, a distance context map (if NBLTYPES-D is 1, the context map does not need to be encoded; in that case, the context ID is always 0).

  • A sequence of prefix codes for literals, the number of which is equal to NBLTYPES-L multiplied by the number of different context IDs (at most 64 for each literal block type). These codes are used for the literals in the next commands.

  • A sequence of prefix codes for insert-and-copy lengths, the number of which is equal to NBLTYPES-I multiplied by the number of different context IDs (at most 256 for each insert-and-copy block type). These codes are used for the next commands.

  • A sequence of prefix codes for distances, the number of which is equal to NBLTYPES-D multiplied by the number of different context IDs (4 for each distance block type). These codes are used for the distances in the next commands.

  • Data stream with block-switch commands, literals, insert-and-copy lengths, and distances.

The block structure helps compression by partitioning the input into separate alphabet distributions with the block types. Block types for literals and distances may be even further partitioned into different distributions using contexts. The per-block-type context maps control which prefix code is used with different contexts.

The intuition behind block switching is that different regions in the input are likely to have different statistics (e.g., HTML versus JavaScript in a web page). There can be more than one block type in a particular meta-block for each category (literals, insert-and-copy lengths, and distances), so block switching can be used to adapt prefix code distributions without starting a new meta-block. Each block type can have multiple prefix code distributions for literals and distances based on context modeling.


中文翻译 (Chinese Translation)

压缩数据集 (Compressed Data Set) 由一个头部 (Header) 和一系列元块 (Meta-Block) 组成。每个元块压缩一个输入字节序列。头部包含未压缩数据的大小(如果预先已知),否则包含零。因此,拥有零头部字节的压缩流是合法的.

元块由元块头部 (Meta-Block Header) 和元块数据 (Meta-Block Data) 组成。元块头部描述了数据部分应该如何解释。它包含数据部分的大小,以及两个位来表示数据是否未压缩或是最后一个元块。元块数据的解释取决于这些位.

元块的未压缩大小可以为零。例如,可以在未压缩类型的前一个元块之后发送空元块,或者仅仅表示数据结束.

有三种不同类型的元块:

压缩元块 (Compressed Meta-Blocks): 这些元块包含 LZ77 前缀编码数据。每个压缩元块分为前缀编码部分 (Prefix Code Section) 和数据部分 (Data Section)。前缀编码部分包含要在数据部分中使用的前缀编码。数据部分包含 LZ77 后向引用 (Backward Reference)、字面量 (Literal) 和静态字典引用 (Static Dictionary Reference) 的序列。每个 LZ77 后向引用是一个 ⟨length, distance⟩ 对,其中 length 表示要复制的字节数,distance 表示在未压缩数据流中向后多少字节开始复制。距离为 1 意味着最后一个字节将被复制 distance 次。距离也可以转换为静态字典中的位置。静态字典引用是特殊处理的后向引用; 详见第 8 节。字面量和后向引用存储在同一数据结构中,称为插入和复制长度编码 (Insert-and-Copy Length Code)。通过这种方式,一个值确定插入长度 (Insert Length),即紧随编码之后的字面量数量,以及复制长度 (Copy Length),即下一个后向引用的长度。后向引用的距离由下一个距离编码确定。最后一个插入和复制长度编码可以具有为 0 的复制长度; 在这种情况下,数据中的最后符号是字面值.

未压缩元块 (Uncompressed Meta-Blocks): 这些元块包含未压缩数据.

元数据元块 (Meta-Data Meta-Blocks): 这些元块保留供将来使用。它们被解码器跳过。本文档未定义这些元块的格式。应用程序可以根据需要定义此类格式.

压缩元块的结构如下:

  • 字面量 (Literals)、插入和复制长度 (Insert-and-Copy Lengths) 和距离 (Distances) 这三个类别中每个类别的块类型 (Block Type) 和块计数 (Block Count)。块类型具有一个小整数值,并标识用于下一个块计数值的前缀编码集。块计数是当前块类型中使用前缀编码和上下文模型的符号数量.

  • 字面量的块类型的前缀编码 (BTYPE-L).

  • 字面量的块计数的前缀编码 (BLEN-L).

  • 插入和复制长度的块类型的前缀编码 (BTYPE-I).

  • 插入和复制长度的块计数的前缀编码 (BLEN-I).

  • 距离的块类型的前缀编码 (BTYPE-D).

  • 距离的块计数的前缀编码 (BLEN-D).

  • 可能的距离参数 NPOSTFIXNDIRECT (如果元块是第一个,它们设置为 0; 否则可以设置为任何非负值,见第 4 节).

  • 字面量块类型数量 NBLTYPES-L 和上下文模式数量 CMODE (如果 NBLTYPES-L 为 0,则解释为 1).

  • 对于每个 NBLTYPES-L 字面量块类型:

    • 字面量的上下文模式 (Context Mode) (零、一或二).
    • 可能的字面量上下文映射 (Literal Context Map) (如果 NBLTYPES-L 为 1,则不需要编码上下文映射; 在这种情况下,上下文 ID 始终为 0).
  • 插入和复制块类型数量 NBLTYPES-I (如果为 0,则解释为 1).

  • 可能的插入和复制上下文映射 (如果 NBLTYPES-I 为 1,则不需要编码上下文映射; 在这种情况下,上下文 ID 始终为 0).

  • 距离块类型数量 NBLTYPES-D (如果为 0,则解释为 1).

  • 可能的距离上下文映射 (如果 NBLTYPES-D 为 1,则不需要编码上下文映射; 在这种情况下,上下文 ID 始终为 0).

  • 字面量的前缀编码序列,其数量等于 NBLTYPES-L 乘以不同上下文 ID 的数量(每个字面量块类型最多 64 个)。这些编码用于下一个命令中的字面量.

  • 插入和复制长度的前缀编码序列,其数量等于 NBLTYPES-I 乘以不同上下文 ID 的数量(每个插入和复制块类型最多 256 个)。这些编码用于下一个命令.

  • 距离的前缀编码序列,其数量等于 NBLTYPES-D 乘以不同上下文 ID 的数量(每个距离块类型 4 个)。这些编码用于下一个命令中的距离.

  • 包含块切换命令 (Block-Switch Command)、字面量、插入和复制长度以及距离的数据流.

块结构 (Block Structure) 通过使用块类型将输入划分为独立的字母表分布来帮助压缩。字面量和距离的块类型可以使用上下文进一步划分为不同的分布。每个块类型的上下文映射 (Per-Block-Type Context Map) 控制在不同上下文下使用哪个前缀编码.

块切换背后的直觉是输入中的不同区域可能具有不同的统计特性(例如,网页中的 HTML 与 JavaScript)。对于每个类别(字面量、插入和复制长度以及距离),特定元块中可以有多个块类型,因此块切换可用于调整前缀编码分布而无需启动新的元块。每个块类型可以基于上下文建模为字面量和距离提供多个前缀编码分布.


来源 (Source): RFC 7932, Section 2
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt