Skip to main content

3. UTF-8 definition

UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]

Encoding Range

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

Single-Octet Sequences

The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.

Multi-Octet Sequences

In a sequence of n octets, n>1:

  • The initial octet has the n higher-order bits set to 1, followed by a bit set to 0
  • The remaining bit(s) of that octet contain bits from the number of the character to be encoded
  • The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded

Encoding Table

The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the character number.

Char. number range  |        UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Encoding Algorithm

Encoding a character to UTF-8 proceeds as follows:

Step 1: Determine Octet Count

Determine the number of octets required from the character number and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character.

Step 2: Prepare High-Order Bits

Prepare the high-order bits of the octets as per the second column of the table.

Step 3: Fill in Character Bits

Fill in the bits marked x from the bits of the character number, expressed in binary. Start by putting the lowest-order bit of the character number in the lowest-order position of the last octet of the sequence, then put the next higher-order bit of the character number in the next higher-order position of that octet, etc. When the x bits of the last octet are filled in, move on to the next to last octet, then to the preceding one, etc. until all x bits are filled in.

Reserved Character Ranges

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

Converting from UTF-16

When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above.

Comparison with CESU-8

This contrasts with CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8.

Decoding Algorithm

Decoding a UTF-8 character proceeds as follows:

Step 1: Initialize Binary Number

Initialize a binary number with all bits set to 0. Up to 21 bits may be needed.

Step 2: Determine Character Bits

Determine which bits encode the character number from the number of octets in the sequence and the second column of the table above (the bits marked x).

Step 3: Distribute Bits

Distribute the bits from the sequence to the binary number, first the lower-order bits from the last octet of the sequence and proceeding to the left until no x bits are left. The binary number is now equal to the character number.

Security Requirement

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.

Examples of Invalid Sequences

For instance, a naive implementation may:

  • Decode the overlong UTF-8 sequence C0 80 into the character U+0000
  • Decode the surrogate pair ED A1 8C ED BE B4 into U+233B4

Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.