Skip to main content

12. Changes from RFC 2279

This section documents the changes made in RFC 3629 compared to the previous specification in RFC 2279.

Character Range Restriction

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

This is a significant change that limits UTF-8 to the Unicode-defined character space, excluding the previously allowed but impractical ranges beyond U+10FFFF.

Normative Source Change

Made Unicode the source of the normative definition of UTF-8, keeping ISO/IEC 10646 as the reference for characters.

This change recognizes Unicode as the primary authoritative source for UTF-8 encoding rules, while ISO/IEC 10646 remains the reference for the character repertoire.

Terminology Clarification

Straightened out terminology. UTF-8 now described in terms of an encoding form of the character number. UCS-2 and UCS-4 almost disappeared.

The updated terminology provides clearer and more consistent descriptions of UTF-8 as an encoding form rather than a transformation format.

Invalid Sequence Handling

Turned the note warning against decoding of invalid sequences into a normative MUST NOT.

Previously, the warning about invalid sequences was informative. In RFC 3629, it became a normative requirement that implementations MUST NOT decode invalid sequences.

BOM Section Addition

Added a new section about the UTF-8 BOM, with advice for protocols.

Section 6 provides comprehensive guidance on:

  • When to use or forbid the BOM
  • How to handle the BOM in different protocol contexts
  • Recommendations for protocol designers

MIME Registration Change

Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.

The version-specific charset label was removed in favor of the generic "UTF-8" label that covers all Unicode versions after Amendment 5.

ABNF Syntax Addition

Added an ABNF syntax for valid UTF-8 octet sequences

Section 4 now includes a formal ABNF grammar that precisely defines valid UTF-8 byte sequences, making it easier for implementers to validate UTF-8 data.

Security Section Expansion

Expanded Security Considerations section, in particular impact of Unicode normalization

The security section was significantly enhanced to include:

  • Detailed discussion of overlong encoding attacks
  • Buffer overflow risks
  • Canonical equivalence security implications
  • References to Unicode Normalization Forms