Skip to main content

1. Introduction (简介)

官方英文原文 (Official English Text)

1.1. Purpose

The purpose of this specification is to define a lossless compressed data format that:

  • is independent of CPU type, operating system, file system, and character set; hence, it can be used for interchange.

  • can be produced or consumed, even for an arbitrarily long, sequentially presented input data stream, using only an a priori bounded amount of intermediate storage; hence, it can be used in data communications or similar structures, such as Unix filters.

  • compresses data with a compression ratio comparable to the best currently available general-purpose compression methods, in particular, considerably better than the gzip program.

  • decompresses much faster than current LZMA implementations.

The data format defined by this specification does not attempt to:

  • allow random access to compressed data.

  • compress specialized data (e.g., raster graphics) as densely as the best currently available specialized algorithms.

This document is the authoritative specification of the brotli compressed data format. It defines the set of valid brotli compressed data streams and a decoder algorithm that produces the uncompressed data stream from a valid brotli compressed data stream.

1.2. Intended Audience

This specification is intended for use by software implementers to compress data into and/or decompress data from the brotli format.

The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. Familiarity with the technique of Huffman coding is helpful but not required.

This specification uses (heavily) the notations and terminology introduced in the DEFLATE format specification [RFC1951]. For the sake of completeness, we always include the whole text of the relevant parts of RFC 1951; therefore, familiarity with the DEFLATE format is helpful but not required.

The compressed data format defined in this specification is an integral part of the WOFF File Format 2.0 [WOFF2]; therefore, this specification is also intended for implementers of WOFF 2.0 compressors and decompressors.

1.3. Scope

This document specifies a method for representing a sequence of bytes as a (usually shorter) sequence of bits and a method for packing the latter bit sequence into bytes.

1.4. Compliance

Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any data set that conforms to all the specifications presented here. A compliant compressor must produce data sets that conform to all the specifications presented here.

1.5. Definitions of Terms and Conventions Used

Byte: 8 bits stored or transmitted as a unit (same as an octet). For this specification, a byte is exactly 8 bits, even on machines that store a character on a number of bits different from eight. See below for the numbering of bits within a byte.

String: a sequence of arbitrary bytes.

Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most and least significant bit (lsb), and since we write numbers with the most significant digit on the left, we also write bytes with the most significant bit (msb) on the left. In the diagrams below, we number the bits of a byte so that bit 0 is the least significant bit, i.e., the bits are numbered:

+--------+
|76543210|
+--------+

Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are stored with the least-significant byte first (at the lower memory address). For example, the decimal number 520 is stored as:

    0        1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + more significant byte = 2 x 256
+ less significant byte = 8

1.5.1. Packing into Bytes

This document does not address the issue of the order in which bits of a byte are transmitted on a bit-sequential medium, since the final data format described here is byte rather than bit oriented. However, we describe the compressed block format in below as a sequence of data elements of various bit lengths, not a sequence of bytes. We must therefore specify how to pack these data elements into bytes to form the final compressed byte sequence:

  • Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least-significant bit of the byte.

  • Data elements other than Huffman codes are packed starting with the least-significant bit of the data element.

  • Huffman codes are packed starting with the most-significant bit of the code.

In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the right margin and proceeding to the left, with the most-significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position).

As an example, consider packing the following data elements into a sequence of 3 bytes:

  • 2-bit value 1 (binary 01)
  • 3-bit value 2 (binary 010)
  • 5-bit value 6 (binary 00110)
  • 4-bit code with bit pattern 1011 (with the first bit being 1, the second bit being 0, the third bit being 1, and the fourth bit being 1)
  • 2-bit value 0 (binary 00)
  • 4-bit code with bit pattern 0011
  • 5-bit value 31 (binary 11111)

The following diagram shows how these elements are packed:

       Byte 0           Byte 1           Byte 2
+---------------+----------------+----------------+
|0|1|1|0|0|0|1|0|1|1|0|1|1|0|0|0|0|1|1|1|1|1|1|1|
+---------------+----------------+----------------+
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | | |
| | | | | | | | +-------------|---+ 5-bit value 31
| | | | | | | +-----------------|---|-+ 4-bit code 0011
| | | | | | +-----------------------+ |
| | | | | | | | 2-bit value 0
| | | | | +-----------------------------+-----+ 4-bit code 1011
| | | +---+ 5-bit value 6
| | +-----+ 3-bit value 2
| +---+ 2-bit value 1

中文翻译 (Chinese Translation)

1.1. Purpose (目的)

本规范的目的是定义一种无损压缩数据格式 (Lossless Compressed Data Format),该格式具有以下特性:

  • 独立于 CPU 类型、操作系统、文件系统和字符集; 因此可用于数据交换.

  • 即使对于任意长的、顺序呈现的输入数据流,也可以仅使用先验有界的中间存储量来生成或消费数据; 因此可用于数据通信或类似的结构,如 Unix 过滤器.

  • 压缩数据的压缩比 (Compression Ratio) 可与当前最佳的通用压缩方法相媲美,特别是明显优于 gzip 程序.

  • 解压缩速度远快于当前的 LZMA 实现.

本规范定义的数据格式不尝试:

  • 允许对压缩数据进行随机访问 (Random Access).

  • 像当前最佳专用算法那样密集地压缩专用数据(例如,光栅图形 (Raster Graphics)).

本文档是 brotli 压缩数据格式的权威规范。它定义了有效 brotli 压缩数据流 (Valid Brotli Compressed Data Stream) 的集合,以及从有效 brotli 压缩数据流产生未压缩数据流的解码器算法 (Decoder Algorithm).

1.2. Intended Audience (目标受众)

本规范面向软件实现者,用于将数据压缩到 brotli 格式和/或从 brotli 格式解压缩数据.

规范文本假设读者具有位 (Bit) 和其他原始数据表示 (Primitive Data Representation) 级别的基本编程背景。熟悉 Huffman 编码 (Huffman Coding) 技术会有所帮助,但不是必需的.

本规范大量使用 DEFLATE 格式规范 [RFC1951] 中引入的符号和术语。为了完整性,我们始终包含 RFC 1951 相关部分的完整文本; 因此,熟悉 DEFLATE 格式会有所帮助,但不是必需的.

本规范中定义的压缩数据格式是 WOFF File Format 2.0 [WOFF2] 的组成部分; 因此,本规范也面向 WOFF 2.0 压缩器和解压缩器的实现者.

1.3. Scope (范围)

本文档指定了一种将字节序列 (Sequence of Bytes) 表示为(通常更短的)位序列 (Sequence of Bits) 的方法,以及一种将后者的位序列打包到字节中的方法.

1.4. Compliance (合规性)

除非下文另有说明,符合标准的解压缩器 (Compliant Decompressor) 必须能够接受并解压缩符合此处所有规范的任何数据集。符合标准的压缩器 (Compliant Compressor) 必须生成符合此处所有规范的数据集.

1.5. Definitions of Terms and Conventions Used (术语和约定定义)

Byte (字节): 作为一个单元存储或传输的 8 位(与八位字节 (Octet) 相同)。对于本规范,一个字节恰好是 8 位,即使在以不同于 8 位的位数存储字符的机器上也是如此。有关字节内位的编号,请参见下文.

String (字符串): 任意字节序列.

存储在计算机中的字节没有"位序" (Bit Order),因为它们始终被视为一个单元。但是,作为 0 到 255 之间的整数的字节确实有最高有效位 (Most Significant Bit, MSB) 和最低有效位 (Least Significant Bit, LSB),并且由于我们用最高有效数字在左侧写数字,我们也用最高有效位在左侧写字节。在下面的图表中,我们对字节的位进行编号,使得位 0 是最低有效位,即,位的编号为:

+--------+
|76543210|
+--------+

在计算机中,一个数字可能占用多个字节。此处描述的格式中的所有多字节数字 (Multi-byte Number) 都以最低有效字节在前(在较低的内存地址)存储。例如,十进制数 520 存储为:

    0        1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + 更高有效字节 = 2 x 256
+ 更低有效字节 = 8

1.5.1. Packing into Bytes (字节打包)

本文档不涉及在位串行介质 (Bit-Sequential Medium) 上传输字节的位的顺序问题,因为此处描述的最终数据格式是面向字节而非面向位的。但是,我们将下面的压缩块格式 (Compressed Block Format) 描述为各种位长度的数据元素序列,而不是字节序列。因此,我们必须指定如何将这些数据元素打包到字节中以形成最终的压缩字节序列:

  • 数据元素按字节内位号递增的顺序打包到字节中,即从字节的最低有效位开始.

  • 除 Huffman 编码 (Huffman Code) 之外的数据元素从数据元素的最低有效位开始打包.

  • Huffman 编码从编码的最高有效位开始打包.

换句话说,如果将压缩数据打印为字节序列,从第一个字节在右边边距开始,向左边进行,每个字节的最高有效位像往常一样在左边,则可以从右到左解析结果,固定宽度元素以正确的 MSB 到 LSB 顺序,Huffman 编码以位反转顺序(即,编码的第一位在相对的 LSB 位置).

作为示例,考虑将以下数据元素打包到 3 字节序列中:

  • 2 位值 1 (二进制 01)
  • 3 位值 2 (二进制 010)
  • 5 位值 6 (二进制 00110)
  • 4 位编码,位模式 1011 (第一位为 1,第二位为 0,第三位为 1,第四位为 1)
  • 2 位值 0 (二进制 00)
  • 4 位编码,位模式 0011
  • 5 位值 31 (二进制 11111)

下图显示了这些元素是如何打包的:

       Byte 0           Byte 1           Byte 2
+---------------+----------------+----------------+
|0|1|1|0|0|0|1|0|1|1|0|1|1|0|0|0|0|1|1|1|1|1|1|1|
+---------------+----------------+----------------+
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | | |
| | | | | | | | +-------------|---+ 5-bit value 31
| | | | | | | +-----------------|---|-+ 4-bit code 0011
| | | | | | +-----------------------+ |
| | | | | | | | 2-bit value 0
| | | | | +-----------------------------+-----+ 4-bit code 1011
| | | +---+ 5-bit value 6
| | +-----+ 3-bit value 2
| +---+ 2-bit value 1

来源 (Source): RFC 7932, Section 1
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt