3. UTF-8 definition
UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
Encoding Range
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.
Single-Octet Sequences
The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
Multi-Octet Sequences
In a sequence of n octets, n>1:
- The initial octet has the n higher-order bits set to 1, followed by a bit set to 0
- The remaining bit(s) of that octet contain bits from the number of the character to be encoded
- The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded
Encoding Table
The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the character number.
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Encoding Algorithm
Encoding a character to UTF-8 proceeds as follows:
Step 1: Determine Octet Count
Determine the number of octets required from the character number and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character.
Step 2: Prepare High-Order Bits
Prepare the high-order bits of the octets as per the second column of the table.
Step 3: Fill in Character Bits
Fill in the bits marked x from the bits of the character number, expressed in binary. Start by putting the lowest-order bit of the character number in the lowest-order position of the last octet of the sequence, then put the next higher-order bit of the character number in the next higher-order position of that octet, etc. When the x bits of the last octet are filled in, move on to the next to last octet, then to the preceding one, etc. until all x bits are filled in.
Reserved Character Ranges
The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
Converting from UTF-16
When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above.
Comparison with CESU-8
This contrasts with CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8.
Decoding Algorithm
Decoding a UTF-8 character proceeds as follows:
Step 1: Initialize Binary Number
Initialize a binary number with all bits set to 0. Up to 21 bits may be needed.
Step 2: Determine Character Bits
Determine which bits encode the character number from the number of octets in the sequence and the second column of the table above (the bits marked x).
Step 3: Distribute Bits
Distribute the bits from the sequence to the binary number, first the lower-order bits from the last octet of the sequence and proceeding to the left until no x bits are left. The binary number is now equal to the character number.
Security Requirement
Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.
Examples of Invalid Sequences
For instance, a naive implementation may:
- Decode the overlong UTF-8 sequence C0 80 into the character U+0000
- Decode the surrogate pair ED A1 8C ED BE B4 into U+233B4
Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.
Related Links
- Previous: 2. Notational conventions
- Return to RFC 3629 Home
- Next: 4. Syntax of UTF-8 Byte Sequences