11. Considerations for Compressor Implementations (压缩器实现的考虑因素)
官方英文原文 (Official English Text)
Since the intent of this document is to define the brotli compressed data format without reference to any particular compression algorithm, the material in this section is not part of the definition of the format, and a compressor need not follow it in order to be compliant.
11.1. Trivial Compressor
In this section, we present a very simple algorithm that produces a valid brotli stream representing an arbitrary sequence of uncompressed bytes in the form of the following C++ language function.
string BrotliCompressTrivial(const string& u) {
if (u.empty()) {
return string(1, 6);
}
int i;
string c;
c.append(1, 12);
for (i = 0; i + 65535 < u.size(); i += 65536) {
c.append(1, 248);
c.append(1, 255);
c.append(1, 15);
c.append(&u[i], 65536);
}
if (i < u.size()) {
int r = u.size() - i - 1;
c.append(1, (r & 31) << 3);
c.append(1, r >> 5);
c.append(1, 8 + (r >> 13));
c.append(&u[i], r + 1);
}
c.append(1, 3);
return c;
}
Note that this simple algorithm does not actually compress data, that is, the brotli representation will always be bigger than the original, but it shows that every sequence of N uncompressed bytes can be represented with a valid brotli stream that is not longer than N + (3 * (N >> 16) + 5) bytes.
11.2. Aligning Compressed Meta-Blocks to Byte Boundaries
As described in Section 9, only those meta-blocks that immediately follow an uncompressed meta-block or a metadata meta-block are guaranteed to start on a byte boundary. In some applications, it might be required that every non-metadata meta-block starts on a byte boundary. This can be achieved by appending an empty metadata meta-block after every non-metadata meta-block that does not end on a byte boundary.
11.3. Creating Self-Contained Parts within the Compressed Data
In some encoder implementations, it might be required to make a sequence of bytes within a brotli stream self-contained, that is, such that they can be decompressed independently from previous parts of the compressed data. This is a useful feature for three reasons. First, if a large compressed file is damaged, it is possible to recover some of the file after the damage. Second, it is useful when doing differential transfer of compressed data. If a sequence of uncompressed bytes is unchanged and compressed independently from previous data, then the compressed representation may also be unchanged and can therefore be transferred very cheaply. Third, if sequences of uncompressed bytes are compressed independently, it allows for parallel compression of these byte sequences within the same file, in addition to parallel compression of multiple files.
Given two sequences of uncompressed bytes, U0 and U1, we will now describe how to create two sequences of compressed bytes, C0 and C1, such that the concatenation of C0 and C1 is a valid brotli stream, and that C0 and C1 (together with the first byte of C0 that contains the window size) can be decompressed independently from each other to U0 and U1.
When compressing the byte sequence U0 to produce C0, we can use any compressor that works on the complete set of uncompressed bytes U0, with the following two changes. First, the ISLAST bit of the last meta-block of C0 must not be set. Second, C0 must end at a byte-boundary, which can be ensured by appending an empty metadata meta-block to it, as in Section 11.2.
When compressing the byte sequence U1 to produce C1, we can use any compressor that starts a new meta-block at the beginning of U1 within the U0+U1 input stream, with the following two changes. First, backward distances in C1 must not refer to static dictionary words or uncompressed bytes in U0. Even if a sequence of bytes in U1 would match a static dictionary word, or a sequence of bytes that overlaps U0, the compressor must represent this sequence of bytes with a combination of literal insertions and backward references to bytes in U1 instead. Second, the ring buffer of last four distances must be replenished first with distances in C1 before using it to encode other distances in C1. Note that both compressors producing C0 and C1 have to use the same window size, but the stream header is emitted only by the compressor that produces C0.
Note that this method can be easily generalized to more than two sequences of uncompressed bytes.
中文翻译 (Chinese Translation)
由于本文档的目的是在不参考任何特定压缩算法 (Compression Algorithm) 的情况下定义 brotli 压缩数据格式 (Compressed Data Format),因此本节中的材料不是格式定义的一部分,压缩器 (Compressor) 无需遵循它即可符合标准.
11.1. Trivial Compressor (简单压缩器)
在本节中,我们提供一个非常简单的算法,该算法以以下 C++ 语言函数的形式生成表示任意未压缩字节序列 (Uncompressed Byte Sequence) 的有效 brotli 流 (Brotli Stream).
string BrotliCompressTrivial(const string& u) {
if (u.empty()) {
return string(1, 6);
}
int i;
string c;
c.append(1, 12);
for (i = 0; i + 65535 < u.size(); i += 65536) {
c.append(1, 248);
c.append(1, 255);
c.append(1, 15);
c.append(&u[i], 65536);
}
if (i < u.size()) {
int r = u.size() - i - 1;
c.append(1, (r & 31) << 3);
c.append(1, r >> 5);
c.append(1, 8 + (r >> 13));
c.append(&u[i], r + 1);
}
c.append(1, 3);
return c;
}
请注意,这个简单的算法实际上并不压缩数据 (Compress Data),也就是说,brotli 表示总是比原始数据更大,但它表明每个 N 个未压缩字节的序列都可以用不超过 N + (3 * (N >> 16) + 5) 字节的有效 brotli 流来表示.
11.2. Aligning Compressed Meta-Blocks to Byte Boundaries (将压缩元块对齐到字节边界)
如第 9 节所述,只有紧随未压缩元块 (Uncompressed Meta-Block) 或元数据元块 (Metadata Meta-Block) 之后的元块才保证从字节边界 (Byte Boundary) 开始。在某些应用中,可能需要每个非元数据元块都从字节边界开始。这可以通过在每个未在字节边界上结束的非元数据元块之后附加一个空的元数据元块来实现.
11.3. Creating Self-Contained Parts within the Compressed Data (在压缩数据中创建自包含部分)
在某些编码器实现中,可能需要使 brotli 流中的字节序列自包含 (Self-Contained),即可以独立于压缩数据的先前部分进行解压缩。这是一个有用的功能,原因有三个。首先,如果大型压缩文件损坏,可以在损坏后恢复文件的某些部分。其次,在进行压缩数据的差异传输 (Differential Transfer) 时很有用。如果未压缩字节序列不变并且独立于先前数据进行压缩,则压缩表示也可能不变,因此可以非常便宜地传输。第三,如果未压缩字节序列独立压缩,则除了多个文件的并行压缩 (Parallel Compression) 之外,还允许同一文件中这些字节序列的并行压缩.
给定两个未压缩字节序列 U0 和 U1,我们现在将描述如何创建两个压缩字节序列 C0 和 C1,使得 C0 和 C1 的连接 (Concatenation) 是有效的 brotli 流,并且 C0 和 C1(连同包含窗口大小的 C0 的第一个字节)可以彼此独立地解压缩为 U0 和 U1.
当压缩字节序列 U0 以产生 C0 时,我们可以使用任何作用于完整未压缩字节集 U0 的压缩器,但有以下两个更改。首先,C0 的最后一个元块的 ISLAST 位不得设置。其次,C0 必须在字节边界上结束,这可以通过向其附加一个空的元数据元块来确保,如第 11.2 节所述.
当压缩字节序列 U1 以产生 C1 时,我们可以使用任何在 U0+U1 输入流中 U1 开始处启动新元块的压缩器,但有以下两个更改。首先,C1 中的后向距离 (Backward Distance) 不得引用静态字典词 (Static Dictionary Word) 或 U0 中的未压缩字节。即使 U1 中的字节序列会匹配静态字典词或与 U0 重叠的字节序列,压缩器也必须用字面插入 (Literal Insertion) 和对 U1 中字节的后向引用 (Backward Reference) 的组合来表示这个字节序列。其次,在使用最后四个距离的环形缓冲区 (Ring Buffer) 编码 C1 中的其他距离之前,必须首先用 C1 中的距离补充该缓冲区。请注意,生成 C0 和 C1 的两个压缩器都必须使用相同的窗口大小 (Window Size),但流头部 (Stream Header) 仅由生成 C0 的压缩器发出.
请注意,此方法可以轻松推广到两个以上的未压缩字节序列.
关键实现考虑
简单压缩器 (Trivial Compressor):
- 不实际压缩数据
- 输出大小上界:
N + (3 * (N >> 16) + 5)字节 - 用于验证格式的正确性
字节边界对齐:
- 某些应用需要每个元块从字节边界开始
- 通过插入空元数据元块实现
自包含部分: 三个主要用途:
- 容错性 (Fault Tolerance): 文件损坏后可部分恢复
- 差异传输 (Differential Transfer): 未更改部分可高效传输
- 并行压缩 (Parallel Compression): 同一文件内的并行处理
独立压缩的约束:
C0的ISLAST位不设置C0必须在字节边界结束C1不得引用U0中的数据- 环形缓冲区必须先用
C1的距离初始化 - 窗口大小必须一致
来源 (Source): RFC 7932, Section 11
官方文本 (Official Text): https://www.rfc-editor.org/rfc/rfc7932.txt