10. Security Considerations

Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

Overlong Encoding Attack

A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.

Example 1: NUL Character

For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character.

Example 2: Path Traversal

Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

Real-World Impact

This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

Buffer Overflow Risk

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.

There is therefore a risk of buffer overflow if:

The range of character numbers is not explicitly limited to U+10FFFF
Buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences

Canonical Equivalence Issues

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences.

Example: Accented Characters

For instance, an e with acute accent can be represented by:

The precomposed U+00E9 E ACUTE character
The canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE)

Security Implications

Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever:

String matching is involved
Indexing is performed
Searching is done
Sorting is required
Regular expression matching is used
Selection operations occur

Example Attack Scenario

An example would be string matching of an identifier appearing in a credential and in access control list entries.

Unicode Normalization

This issue is amenable to solutions based on Unicode Normalization Forms, see [UAX15].

Normalization Forms provide canonical representations that ensure equivalent characters compare as equal.

Overlong Encoding Attack​

Example 1: NUL Character​

Example 2: Path Traversal​

Real-World Impact​

Buffer Overflow Risk​

Canonical Equivalence Issues​

Example: Accented Characters​

Security Implications​

Example Attack Scenario​

Unicode Normalization​

Related Links​