Skip to main content

2. Compressed Representation Overview

2. Compressed Representation Overview

A compressed data set consists of a header and a series of meta-blocks. Each meta-block compresses a sequence of input bytes. The header contains the size of the uncompressed data, if known in advance, otherwise it contains zero. As such, it is legal to have a compressed stream with zero header bytes.

A meta-block consists of a meta-block header and meta-block data. The meta-block header describes how the data part should be interpreted. It contains the size of the data part, along with two bits to signal if the data is uncompressed or the last meta-block. The interpretation of the meta-block data depends on these bits.

The uncompressed size of a meta-block may be zero. For example, an empty meta-block may be sent after a previous meta-block of uncompressed type or to simply signal end of data.

There are three different kinds of meta-blocks:

Compressed meta-blocks: These meta-blocks contain LZ77 prefix-coded data. Each compressed meta-block is divided into a prefix code section and a data section. The prefix code section contains prefix codes to be used in the data section. The data section contains a sequence of LZ77 backward references, literals, and static dictionary references. Each LZ77 backward reference is a ⟨length, distance⟩ pair, where length denotes the number of bytes to be copied and distance denotes how many bytes backward in the uncompressed data stream to start copying. A distance of 1 means that the last byte is to be copied distance times. A distance can also be translated to a position in the static dictionary. Static dictionary references are specially handled backward references; see Section 8 for details. Literals and backward references are stored in the same data structure, called an insert-and-copy length code. In this way, one value determines the insert length, which is the number of literals immediately following the code, and the copy length, which is the length of the next backward reference. The distance of the backward reference is determined by the next distance code. The last insert-and-copy length code can have a copy length of 0; in this case, the last symbols in the data are literal values.

Uncompressed meta-blocks: These meta-blocks contain uncompressed data.

Meta-data meta-blocks: These meta-blocks are reserved for future use. They are skipped by the decoder. This document does not define the format of these meta-blocks. Applications can define such formats if needed.

The structure of the compressed meta-block is as follows:

  • A block type and block count for each of the three categories of literals, insert-and-copy lengths, and distances. The block type has a small integer value and identifies a set of prefix codes to be used for the next block count values. The block count is a number of symbols for which the prefix codes and context models in the current block type are used.

  • A prefix code for the block types for literals (BTYPE-L).

  • A prefix code for the block counts for literals (BLEN-L).

  • A prefix code for the block types for insert-and-copy lengths (BTYPE-I).

  • A prefix code for the block counts for insert-and-copy lengths (BLEN-I).

  • A prefix code for the block types for distances (BTYPE-D).

  • A prefix code for the block counts for distances (BLEN-D).

  • Possibly, a distance parameter NPOSTFIX and NDIRECT (if the meta-block is the first one, they are set to 0; otherwise they can be set to any non-negative values, see Section 4).

  • Number of literal block types NBLTYPES-L and number of context modes CMODE (if NBLTYPES-L is 0, it is interpreted as 1).

  • For each of the NBLTYPES-L literal block types:

    • A context mode (zero, one, or two) for the literals.
    • Possibly, a literal context map (if NBLTYPES-L is 1, the context map does not need to be encoded; in that case, the context ID is always 0).
  • Number of insert-and-copy block types NBLTYPES-I (if it is 0, it is interpreted as 1).

  • Possibly, an insert-and-copy context map (if NBLTYPES-I is 1, the context map does not need to be encoded; in that case, the context ID is always 0).

  • Number of distance block types NBLTYPES-D (if it is 0, it is interpreted as 1).

  • Possibly, a distance context map (if NBLTYPES-D is 1, the context map does not need to be encoded; in that case, the context ID is always 0).

  • A sequence of prefix codes for literals, the number of which is equal to NBLTYPES-L multiplied by the number of different context IDs (at most 64 for each literal block type). These codes are used for the literals in the next commands.

  • A sequence of prefix codes for insert-and-copy lengths, the number of which is equal to NBLTYPES-I multiplied by the number of different context IDs (at most 256 for each insert-and-copy block type). These codes are used for the next commands.

  • A sequence of prefix codes for distances, the number of which is equal to NBLTYPES-D multiplied by the number of different context IDs (4 for each distance block type). These codes are used for the distances in the next commands.

  • Data stream with block-switch commands, literals, insert-and-copy lengths, and distances.

The block structure helps compression by partitioning the input into separate alphabet distributions with the block types. Block types for literals and distances may be even further partitioned into different distributions using contexts. The per-block-type context maps control which prefix code is used with different contexts.

The intuition behind block switching is that different regions in the input are likely to have different statistics (e.g., HTML versus JavaScript in a web page). There can be more than one block type in a particular meta-block for each category (literals, insert-and-copy lengths, and distances), so block switching can be used to adapt prefix code distributions without starting a new meta-block. Each block type can have multiple prefix code distributions for literals and distances based on context modeling.


Source: RFC 7932, Section 2 Official Text: https://www.rfc-editor.org/rfc/rfc7932.txt