4. Syntax of UTF-8 Byte Sequences

For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here.

UTF-8 String Definition

A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234].

ABNF Syntax

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Syntax Explanation

UTF8-1: Single-Byte Sequences

%x00-7F

Range: 0x00 to 0x7F (0-127)
Encoding: Complete ASCII range
Pattern: 0xxxxxxx

UTF8-2: Two-Byte Sequences

%xC2-DF UTF8-tail

First byte: 0xC2 to 0xDF (194-223)
Tail byte: 0x80 to 0xBF (128-191)
Pattern: 110xxxxx 10xxxxxx
Note: First byte cannot be 0xC0 or 0xC1 (would produce overlong encoding)

UTF8-3: Three-Byte Sequences

Four cases:

Case 1: `%xE0 %xA0-BF UTF8-tail`

First byte: 0xE0
Second byte: 0xA0 to 0xBF
Third byte: 0x80 to 0xBF
Encoding range: U+0800 to U+0FFF

Case 2: `%xE1-EC 2( UTF8-tail )`

First byte: 0xE1 to 0xEC
Following bytes: Two UTF8-tail (0x80-0xBF)
Encoding range: U+1000 to U+CFFF

Case 3: `%xED %x80-9F UTF8-tail`

First byte: 0xED
Second byte: 0x80 to 0x9F
Third byte: 0x80 to 0xBF
Encoding range: U+D000 to U+D7FF
Note: Avoids surrogate pair range (U+D800-U+DFFF)

Case 4: `%xEE-EF 2( UTF8-tail )`

First byte: 0xEE to 0xEF
Following bytes: Two UTF8-tail
Encoding range: U+E000 to U+FFFF

UTF8-4: Four-Byte Sequences

Three cases:

Case 1: `%xF0 %x90-BF 2( UTF8-tail )`

First byte: 0xF0
Second byte: 0x90 to 0xBF
Following bytes: Two UTF8-tail
Encoding range: U+10000 to U+3FFFF

Case 2: `%xF1-F3 3( UTF8-tail )`

First byte: 0xF1 to 0xF3
Following bytes: Three UTF8-tail
Encoding range: U+40000 to U+FFFFF

Case 3: `%xF4 %x80-8F 2( UTF8-tail )`

First byte: 0xF4
Second byte: 0x80 to 0x8F
Following bytes: Two UTF8-tail
Encoding range: U+100000 to U+10FFFF

Invalid Byte Values

The following byte values never appear in valid UTF-8 sequences:

Prohibited byte values:
- 0xC0, 0xC1     (would produce overlong 2-byte sequences)
- 0xF5 - 0xFF    (beyond Unicode range)

Complete Byte Range Summary

Byte Value Range	Meaning	Validity
0x00-0x7F	Single-byte character (ASCII)	✅ Valid
0x80-0xBF	Continuation byte	✅ Valid as tail only
0xC0-0xC1	Prohibited	❌ Invalid
0xC2-0xDF	2-byte sequence first byte	✅ Valid
0xE0-0xEF	3-byte sequence first byte	✅ Valid
0xF0-0xF4	4-byte sequence first byte	✅ Valid
0xF5-0xFF	Prohibited	❌ Invalid

Validation Examples

Valid Sequences

Example 1: 0x41
Check: 0x41 in [0x00-0x7F] → UTF8-1 → ✅ Valid
Character: 'A'

Example 2: 0xC2 0xA9
Check: 0xC2 in [0xC2-0xDF], 0xA9 in [0x80-0xBF] → UTF8-2 → ✅ Valid
Character: '©'

Example 3: 0xE4 0xBD 0xA0
Check: 0xE4 in [0xE1-0xEC], next two bytes in [0x80-0xBF] → UTF8-3 → ✅ Valid
Character: '你'

Example 4: 0xF0 0x9F 0x98 0x80
Check: 0xF0 followed by 0x9F in [0x90-0xBF], next two bytes in [0x80-0xBF] → UTF8-4 → ✅ Valid
Character: '😀'

Invalid Sequences

Example 1: 0xC0 0x80
Problem: 0xC0 is prohibited → ❌ Invalid (overlong encoding)

Example 2: 0xED 0xA0 0x80
Problem: 0xED followed by 0xA0 not in [0x80-0x9F] → ❌ Invalid (surrogate pair range)

Example 3: 0xF5 0x80 0x80 0x80
Problem: 0xF5 is prohibited → ❌ Invalid (beyond Unicode range)

Example 4: 0xE4 0xBD
Problem: 3-byte sequence incomplete → ❌ Invalid (truncated)

⚠️ Important Note

NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative. Implementors are urged to rely on the authoritative source, rather than on this ABNF.

Implementation Suggestion

Validation Algorithm Pseudocode

def is_valid_utf8(bytes):
    i = 0
    while i &lt; len(bytes):
        b = bytes[i]
        
        if b &lt;= 0x7F:  # UTF8-1
            i += 1
        elif 0xC2 &lt;= b &lt;= 0xDF:  # UTF8-2
            if i + 1 >= len(bytes) or not (0x80 &lt;= bytes[i+1] &lt;= 0xBF):
                return False
            i += 2
        elif b == 0xE0:  # UTF8-3 case 1
            if i + 2 >= len(bytes):
                return False
            if not (0xA0 &lt;= bytes[i+1] &lt;= 0xBF and 0x80 &lt;= bytes[i+2] &lt;= 0xBF):
                return False
            i += 3
        elif 0xE1 &lt;= b &lt;= 0xEC:  # UTF8-3 case 2
            if i + 2 >= len(bytes):
                return False
            if not (0x80 &lt;= bytes[i+1] &lt;= 0xBF and 0x80 &lt;= bytes[i+2] &lt;= 0xBF):
                return False
            i += 3
        elif b == 0xED:  # UTF8-3 case 3
            if i + 2 >= len(bytes):
                return False
            if not (0x80 &lt;= bytes[i+1] &lt;= 0x9F and 0x80 &lt;= bytes[i+2] &lt;= 0xBF):
                return False
            i += 3
        elif 0xEE &lt;= b &lt;= 0xEF:  # UTF8-3 case 4
            if i + 2 >= len(bytes):
                return False
            if not (0x80 &lt;= bytes[i+1] &lt;= 0xBF and 0x80 &lt;= bytes[i+2] &lt;= 0xBF):
                return False
            i += 3
        elif b == 0xF0:  # UTF8-4 case 1
            if i + 3 >= len(bytes):
                return False
            if not (0x90 &lt;= bytes[i+1] &lt;= 0xBF and 
                    0x80 &lt;= bytes[i+2] &lt;= 0xBF and 
                    0x80 &lt;= bytes[i+3] &lt;= 0xBF):
                return False
            i += 4
        elif 0xF1 &lt;= b &lt;= 0xF3:  # UTF8-4 case 2
            if i + 3 >= len(bytes):
                return False
            if not (0x80 &lt;= bytes[i+1] &lt;= 0xBF and 
                    0x80 &lt;= bytes[i+2] &lt;= 0xBF and 
                    0x80 &lt;= bytes[i+3] &lt;= 0xBF):
                return False
            i += 4
        elif b == 0xF4:  # UTF8-4 case 3
            if i + 3 >= len(bytes):
                return False
            if not (0x80 &lt;= bytes[i+1] &lt;= 0x8F and 
                    0x80 &lt;= bytes[i+2] &lt;= 0xBF and 
                    0x80 &lt;= bytes[i+3] &lt;= 0xBF):
                return False
            i += 4
        else:
            return False  # Invalid byte
    
    return True

UTF-8 String Definition​

ABNF Syntax​

Syntax Explanation​

UTF8-1: Single-Byte Sequences​

UTF8-2: Two-Byte Sequences​

UTF8-3: Three-Byte Sequences​

Case 1: %xE0 %xA0-BF UTF8-tail​

Case 2: %xE1-EC 2( UTF8-tail )​

Case 3: %xED %x80-9F UTF8-tail​

Case 4: %xEE-EF 2( UTF8-tail )​

UTF8-4: Four-Byte Sequences​

Case 1: %xF0 %x90-BF 2( UTF8-tail )​

Case 2: %xF1-F3 3( UTF8-tail )​

Case 3: %xF4 %x80-8F 2( UTF8-tail )​

Invalid Byte Values​

Complete Byte Range Summary​

Validation Examples​

Valid Sequences​

Invalid Sequences​

⚠️ Important Note​

Implementation Suggestion​

Validation Algorithm Pseudocode​

Related Links​