11. Considerations for Compressor Implementations

Considerations for Compressor Implementations

Since the intent of this document is to define the brotli compressed data format without reference to any particular compression algorithm, the material in this section is not part of the definition of the format, and a compressor need not follow it in order to be compliant.

11.1. Trivial Compressor

In this section, we present a very simple algorithm that produces a valid brotli stream representing an arbitrary sequence of uncompressed bytes in the form of the following C++ language function.

     string BrotliCompressTrivial(const string& u) \{
        if (u.empty()) \{
           return string(1, 6);
        \}
        int i;
        string c;
        c.append(1, 12);
        for (i = 0; i + 65535 &lt; u.size(); i += 65536) \{
           c.append(1, 248);
           c.append(1, 255);
           c.append(1, 15);
           c.append(&u[i], 65536);
        \}
        if (i &lt; u.size()) \{
           int r = u.size() - i - 1;
           c.append(1, (r & 31) &lt;&lt; 3);
           c.append(1, r >> 5);
           c.append(1, 8 + (r >> 13));
           c.append(&u[i], r + 1);
        \}
        c.append(1, 3);
        return c;
     \}

Note that this simple algorithm does not actually compress data, that is, the brotli representation will always be bigger than the original, but it shows that every sequence of N uncompressed bytes can be represented with a valid brotli stream that is not longer than N + (3 * (N >> 16) + 5) bytes.

11.2. Aligning Compressed Meta-Blocks to Byte Boundaries

As described in Section 9, only those meta-blocks that immediately follow an uncompressed meta-block or a metadata meta-block are guaranteed to start on a byte boundary. In some applications, it might be required that every non-metadata meta-block starts on a byte boundary. This can be achieved by appending an empty metadata meta- block after every non-metadata meta-block that does not end on a byte boundary.

11.3. Creating Self-Contained Parts within the Compressed Data

In some encoder implementations, it might be required to make a sequence of bytes within a brotli stream self-contained, that is, such that they can be decompressed independently from previous parts of the compressed data. This is a useful feature for three reasons. First, if a large compressed file is damaged, it is possible to recover some of the file after the damage. Second, it is useful when doing differential transfer of compressed data. If a sequence of uncompressed bytes is unchanged and compressed independently from previous data, then the compressed representation may also be unchanged and can therefore be transferred very cheaply. Third, if sequences of uncompressed bytes are compressed independently, it allows for parallel compression of these byte sequences within the same file, in addition to parallel compression of multiple files.

Given two sequences of uncompressed bytes, U0 and U1, we will now describe how to create two sequences of compressed bytes, C0 and C1, such that the concatenation of C0 and C1 is a valid brotli stream, and that C0 and C1 (together with the first byte of C0 that contains the window size) can be decompressed independently from each other to U0 and U1.

When compressing the byte sequence U0 to produce C0, we can use any compressor that works on the complete set of uncompressed bytes U0, with the following two changes. First, the ISLAST bit of the last meta-block of C0 must not be set. Second, C0 must end at a byte- boundary, which can be ensured by appending an empty metadata meta- block to it, as in Section 11.2.

When compressing the byte sequence U1 to produce C1, we can use any compressor that starts a new meta-block at the beginning of U1 within the U0+U1 input stream, with the following two changes. First, backward distances in C1 must not refer to static dictionary words or uncompressed bytes in U0. Even if a sequence of bytes in U1 would match a static dictionary word, or a sequence of bytes that overlaps U0, the compressor must represent this sequence of bytes with a combination of literal insertions and backward references to bytes in U1 instead. Second, the ring buffer of last four distances must be replenished first with distances in C1 before using it to encode other distances in C1. Note that both compressors producing C0 and C1 have to use the same window size, but the stream header is emitted only by the compressor that produces C0.

Note that this method can be easily generalized to more than two sequences of uncompressed bytes.

Source: RFC 7932 Official Text: https://www.rfc-editor.org/rfc/rfc7932.txt