6. Byte order mark (BOM)
Definition
The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but the BOM name hints at a second possible usage of the character: to prepend a U+FEFF character to a stream of UCS characters as a "signature".
Usage as Signature
A receiver of such a serialized stream may then use the initial character as a hint that:
- The stream consists of UCS characters
- To recognize which UCS encoding is involved
- With encodings having a multi-octet encoding unit, as a way to recognize the serialization order of the octets
UTF-8 and BOM
UTF-8 having a single-octet encoding unit, this last function is useless and the BOM will always appear as the octet sequence EF BB BF.
Interpretation Rules
Position Matters
It is important to understand that the character U+FEFF appearing at any position other than the beginning of a stream MUST be interpreted with the semantics for the zero-width non-breaking space, and MUST NOT be interpreted as a signature.
Stripping Considerations
When interpreted as a signature, the Unicode standard suggests than an initial U+FEFF character may be stripped before processing the text. Such stripping is necessary in some cases (e.g., when concatenating two strings, because otherwise the resulting string may contain an unintended "ZERO WIDTH NO-BREAK SPACE" at the connection point), but might affect an external process at a different layer (such as a digital signature or a count of the characters) that is relying on the presence of all characters in the stream.
Recommendation
It is therefore RECOMMENDED to:
- Avoid stripping an initial U+FEFF interpreted as a signature without a good reason
- Ignore it instead of stripping it when appropriate (such as for display)
- Strip it only when really necessary
Ambiguity Issues
U+FEFF in the first position of a stream MAY be interpreted as a zero-width non-breaking space, and is not always a signature.
Unicode 3.2 Addition
In an attempt at diminishing this uncertainty, Unicode 3.2 adds a new character, U+2060 "WORD JOINER", with exactly the same semantics and usage as U+FEFF except for the signature function, and strongly recommends its exclusive use for expressing word-joining semantics.
Future Clarity
Eventually, following this recommendation will make it all but certain that any initial U+FEFF is a signature, not an intended "ZERO WIDTH NO-BREAK SPACE".
Protocol Guidelines
In the meantime, the uncertainty unfortunately remains and may affect Internet protocols. Protocol specifications MAY restrict usage of U+FEFF as a signature in order to reduce or eliminate the potential ill effects of this uncertainty.
In the interest of striking a balance between the advantages (reduction of uncertainty) and drawbacks (loss of the signature function) of such restrictions, it is useful to distinguish a few cases:
Case 1: Mandated UTF-8
A protocol SHOULD forbid use of U+FEFF as a signature for those textual protocol elements that the protocol mandates to be always UTF-8, the signature function being totally useless in those cases.
Case 2: Reliable Identification Mechanisms
A protocol SHOULD also forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly.
This will be the case when the protocol elements are maintained tightly under the control of the implementation from the time of their creation to the time of their (properly labeled) transmission.
Case 3: No Identification Mechanisms
A protocol SHOULD NOT forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol does not provide character encoding identification mechanisms, when:
- A ban would be unenforceable, or
- It is expected that implementations of the protocol will not be in a position to always use the mechanisms properly
The latter two cases are likely to occur with larger protocol elements such as MIME entities, especially when implementations of the protocol will obtain such entities from:
- File systems
- Protocols that do not have encoding identification mechanisms for payloads (such as FTP)
- Other protocols that do not guarantee proper identification of character encoding (such as HTTP)
Implementation Guidance
When Forbidden
When a protocol forbids use of U+FEFF as a signature for a certain protocol element, then any initial U+FEFF in that protocol element MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE".
When Not Forbidden
When a protocol does NOT forbid use of U+FEFF as a signature for a certain protocol element, then implementations SHOULD be prepared to:
- Handle a signature in that element
- React appropriately: using the signature to identify the character encoding as necessary
- Strip or ignore the signature as appropriate
Related Links
- Previous: 5. Versions of the standards
- Return to RFC 3629 Home
- Next: 7. Examples