Skip to main content

10. Security Considerations

Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

Overlong Encoding Attack

A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.

Example 1: NUL Character

For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character.

Example 2: Path Traversal

Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

Real-World Impact

This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

Buffer Overflow Risk

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.

There is therefore a risk of buffer overflow if:

  • The range of character numbers is not explicitly limited to U+10FFFF
  • Buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences

Canonical Equivalence Issues

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences.

Example: Accented Characters

For instance, an e with acute accent can be represented by:

  • The precomposed U+00E9 E ACUTE character
  • The canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE)

Security Implications

Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever:

  • String matching is involved
  • Indexing is performed
  • Searching is done
  • Sorting is required
  • Regular expression matching is used
  • Selection operations occur

Example Attack Scenario

An example would be string matching of an identifier appearing in a credential and in access control list entries.

Unicode Normalization

This issue is amenable to solutions based on Unicode Normalization Forms, see [UAX15].

Normalization Forms provide canonical representations that ensure equivalent characters compare as equal.