10. Security Considerations
Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
Overlong Encoding Attack
A particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.
Example 1: NUL Character
For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but erroneously allow the illegal two-octet sequence C0 80 and interpret it as a NUL character.
Example 2: Path Traversal
Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.
Real-World Impact
This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.
Buffer Overflow Risk
Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.
There is therefore a risk of buffer overflow if:
- The range of character numbers is not explicitly limited to U+10FFFF
- Buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences
Canonical Equivalence Issues
Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences.
Example: Accented Characters
For instance, an e with acute accent can be represented by:
- The precomposed U+00E9 E ACUTE character
- The canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE)
Security Implications
Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever:
- String matching is involved
- Indexing is performed
- Searching is done
- Sorting is required
- Regular expression matching is used
- Selection operations occur
Example Attack Scenario
An example would be string matching of an identifier appearing in a credential and in access control list entries.
Unicode Normalization
This issue is amenable to solutions based on Unicode Normalization Forms, see [UAX15].
Normalization Forms provide canonical representations that ensure equivalent characters compare as equal.
Related Links
- Previous: 9. IANA Considerations
- Return to RFC 3629 Home
- Next: 11. Acknowledgements