4. Syntax of UTF-8 Byte Sequences
For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here.
UTF-8 String Definition
A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234].
ABNF Syntax
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
Syntax Explanation
UTF8-1: Single-Byte Sequences
%x00-7F
- Range: 0x00 to 0x7F (0-127)
- Encoding: Complete ASCII range
- Pattern:
0xxxxxxx
UTF8-2: Two-Byte Sequences
%xC2-DF UTF8-tail
- First byte: 0xC2 to 0xDF (194-223)
- Tail byte: 0x80 to 0xBF (128-191)
- Pattern:
110xxxxx 10xxxxxx - Note: First byte cannot be 0xC0 or 0xC1 (would produce overlong encoding)
UTF8-3: Three-Byte Sequences
Four cases:
Case 1: %xE0 %xA0-BF UTF8-tail
- First byte: 0xE0
- Second byte: 0xA0 to 0xBF
- Third byte: 0x80 to 0xBF
- Encoding range: U+0800 to U+0FFF
Case 2: %xE1-EC 2( UTF8-tail )
- First byte: 0xE1 to 0xEC
- Following bytes: Two UTF8-tail (0x80-0xBF)
- Encoding range: U+1000 to U+CFFF
Case 3: %xED %x80-9F UTF8-tail
- First byte: 0xED
- Second byte: 0x80 to 0x9F
- Third byte: 0x80 to 0xBF
- Encoding range: U+D000 to U+D7FF
- Note: Avoids surrogate pair range (U+D800-U+DFFF)
Case 4: %xEE-EF 2( UTF8-tail )
- First byte: 0xEE to 0xEF
- Following bytes: Two UTF8-tail
- Encoding range: U+E000 to U+FFFF
UTF8-4: Four-Byte Sequences
Three cases:
Case 1: %xF0 %x90-BF 2( UTF8-tail )
- First byte: 0xF0
- Second byte: 0x90 to 0xBF
- Following bytes: Two UTF8-tail
- Encoding range: U+10000 to U+3FFFF
Case 2: %xF1-F3 3( UTF8-tail )
- First byte: 0xF1 to 0xF3
- Following bytes: Three UTF8-tail
- Encoding range: U+40000 to U+FFFFF
Case 3: %xF4 %x80-8F 2( UTF8-tail )
- First byte: 0xF4
- Second byte: 0x80 to 0x8F
- Following bytes: Two UTF8-tail
- Encoding range: U+100000 to U+10FFFF
Invalid Byte Values
The following byte values never appear in valid UTF-8 sequences:
Prohibited byte values:
- 0xC0, 0xC1 (would produce overlong 2-byte sequences)
- 0xF5 - 0xFF (beyond Unicode range)
Complete Byte Range Summary
| Byte Value Range | Meaning | Validity |
|---|---|---|
| 0x00-0x7F | Single-byte character (ASCII) | ✅ Valid |
| 0x80-0xBF | Continuation byte | ✅ Valid as tail only |
| 0xC0-0xC1 | Prohibited | ❌ Invalid |
| 0xC2-0xDF | 2-byte sequence first byte | ✅ Valid |
| 0xE0-0xEF | 3-byte sequence first byte | ✅ Valid |
| 0xF0-0xF4 | 4-byte sequence first byte | ✅ Valid |
| 0xF5-0xFF | Prohibited | ❌ Invalid |
Validation Examples
Valid Sequences
Example 1: 0x41
Check: 0x41 in [0x00-0x7F] → UTF8-1 → ✅ Valid
Character: 'A'
Example 2: 0xC2 0xA9
Check: 0xC2 in [0xC2-0xDF], 0xA9 in [0x80-0xBF] → UTF8-2 → ✅ Valid
Character: '©'
Example 3: 0xE4 0xBD 0xA0
Check: 0xE4 in [0xE1-0xEC], next two bytes in [0x80-0xBF] → UTF8-3 → ✅ Valid
Character: '你'
Example 4: 0xF0 0x9F 0x98 0x80
Check: 0xF0 followed by 0x9F in [0x90-0xBF], next two bytes in [0x80-0xBF] → UTF8-4 → ✅ Valid
Character: '😀'
Invalid Sequences
Example 1: 0xC0 0x80
Problem: 0xC0 is prohibited → ❌ Invalid (overlong encoding)
Example 2: 0xED 0xA0 0x80
Problem: 0xED followed by 0xA0 not in [0x80-0x9F] → ❌ Invalid (surrogate pair range)
Example 3: 0xF5 0x80 0x80 0x80
Problem: 0xF5 is prohibited → ❌ Invalid (beyond Unicode range)
Example 4: 0xE4 0xBD
Problem: 3-byte sequence incomplete → ❌ Invalid (truncated)
⚠️ Important Note
NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative. Implementors are urged to rely on the authoritative source, rather than on this ABNF.
Implementation Suggestion
Validation Algorithm Pseudocode
def is_valid_utf8(bytes):
i = 0
while i < len(bytes):
b = bytes[i]
if b <= 0x7F: # UTF8-1
i += 1
elif 0xC2 <= b <= 0xDF: # UTF8-2
if i + 1 >= len(bytes) or not (0x80 <= bytes[i+1] <= 0xBF):
return False
i += 2
elif b == 0xE0: # UTF8-3 case 1
if i + 2 >= len(bytes):
return False
if not (0xA0 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif 0xE1 <= b <= 0xEC: # UTF8-3 case 2
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif b == 0xED: # UTF8-3 case 3
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0x9F and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif 0xEE <= b <= 0xEF: # UTF8-3 case 4
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif b == 0xF0: # UTF8-4 case 1
if i + 3 >= len(bytes):
return False
if not (0x90 <= bytes[i+1] <= 0xBF and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
elif 0xF1 <= b <= 0xF3: # UTF8-4 case 2
if i + 3 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
elif b == 0xF4: # UTF8-4 case 3
if i + 3 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0x8F and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
else:
return False # Invalid byte
return True
Related Links
- Previous: 3. UTF-8 definition
- Return to RFC 3629 Home
- Next: 5. Versions of the standards