Skip to main content

4. Syntax of UTF-8 Byte Sequences

For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here.

UTF-8 String Definition

A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234].

ABNF Syntax

UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF

Syntax Explanation

UTF8-1: Single-Byte Sequences

%x00-7F
  • Range: 0x00 to 0x7F (0-127)
  • Encoding: Complete ASCII range
  • Pattern: 0xxxxxxx

UTF8-2: Two-Byte Sequences

%xC2-DF UTF8-tail
  • First byte: 0xC2 to 0xDF (194-223)
  • Tail byte: 0x80 to 0xBF (128-191)
  • Pattern: 110xxxxx 10xxxxxx
  • Note: First byte cannot be 0xC0 or 0xC1 (would produce overlong encoding)

UTF8-3: Three-Byte Sequences

Four cases:

Case 1: %xE0 %xA0-BF UTF8-tail

  • First byte: 0xE0
  • Second byte: 0xA0 to 0xBF
  • Third byte: 0x80 to 0xBF
  • Encoding range: U+0800 to U+0FFF

Case 2: %xE1-EC 2( UTF8-tail )

  • First byte: 0xE1 to 0xEC
  • Following bytes: Two UTF8-tail (0x80-0xBF)
  • Encoding range: U+1000 to U+CFFF

Case 3: %xED %x80-9F UTF8-tail

  • First byte: 0xED
  • Second byte: 0x80 to 0x9F
  • Third byte: 0x80 to 0xBF
  • Encoding range: U+D000 to U+D7FF
  • Note: Avoids surrogate pair range (U+D800-U+DFFF)

Case 4: %xEE-EF 2( UTF8-tail )

  • First byte: 0xEE to 0xEF
  • Following bytes: Two UTF8-tail
  • Encoding range: U+E000 to U+FFFF

UTF8-4: Four-Byte Sequences

Three cases:

Case 1: %xF0 %x90-BF 2( UTF8-tail )

  • First byte: 0xF0
  • Second byte: 0x90 to 0xBF
  • Following bytes: Two UTF8-tail
  • Encoding range: U+10000 to U+3FFFF

Case 2: %xF1-F3 3( UTF8-tail )

  • First byte: 0xF1 to 0xF3
  • Following bytes: Three UTF8-tail
  • Encoding range: U+40000 to U+FFFFF

Case 3: %xF4 %x80-8F 2( UTF8-tail )

  • First byte: 0xF4
  • Second byte: 0x80 to 0x8F
  • Following bytes: Two UTF8-tail
  • Encoding range: U+100000 to U+10FFFF

Invalid Byte Values

The following byte values never appear in valid UTF-8 sequences:

Prohibited byte values:
- 0xC0, 0xC1 (would produce overlong 2-byte sequences)
- 0xF5 - 0xFF (beyond Unicode range)

Complete Byte Range Summary

Byte Value RangeMeaningValidity
0x00-0x7FSingle-byte character (ASCII)✅ Valid
0x80-0xBFContinuation byte✅ Valid as tail only
0xC0-0xC1Prohibited❌ Invalid
0xC2-0xDF2-byte sequence first byte✅ Valid
0xE0-0xEF3-byte sequence first byte✅ Valid
0xF0-0xF44-byte sequence first byte✅ Valid
0xF5-0xFFProhibited❌ Invalid

Validation Examples

Valid Sequences

Example 1: 0x41
Check: 0x41 in [0x00-0x7F] → UTF8-1 → ✅ Valid
Character: 'A'

Example 2: 0xC2 0xA9
Check: 0xC2 in [0xC2-0xDF], 0xA9 in [0x80-0xBF] → UTF8-2 → ✅ Valid
Character: '©'

Example 3: 0xE4 0xBD 0xA0
Check: 0xE4 in [0xE1-0xEC], next two bytes in [0x80-0xBF] → UTF8-3 → ✅ Valid
Character: '你'

Example 4: 0xF0 0x9F 0x98 0x80
Check: 0xF0 followed by 0x9F in [0x90-0xBF], next two bytes in [0x80-0xBF] → UTF8-4 → ✅ Valid
Character: '😀'

Invalid Sequences

Example 1: 0xC0 0x80
Problem: 0xC0 is prohibited → ❌ Invalid (overlong encoding)

Example 2: 0xED 0xA0 0x80
Problem: 0xED followed by 0xA0 not in [0x80-0x9F] → ❌ Invalid (surrogate pair range)

Example 3: 0xF5 0x80 0x80 0x80
Problem: 0xF5 is prohibited → ❌ Invalid (beyond Unicode range)

Example 4: 0xE4 0xBD
Problem: 3-byte sequence incomplete → ❌ Invalid (truncated)

⚠️ Important Note

NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative. Implementors are urged to rely on the authoritative source, rather than on this ABNF.

Implementation Suggestion

Validation Algorithm Pseudocode

def is_valid_utf8(bytes):
i = 0
while i < len(bytes):
b = bytes[i]

if b <= 0x7F: # UTF8-1
i += 1
elif 0xC2 <= b <= 0xDF: # UTF8-2
if i + 1 >= len(bytes) or not (0x80 <= bytes[i+1] <= 0xBF):
return False
i += 2
elif b == 0xE0: # UTF8-3 case 1
if i + 2 >= len(bytes):
return False
if not (0xA0 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif 0xE1 <= b <= 0xEC: # UTF8-3 case 2
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif b == 0xED: # UTF8-3 case 3
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0x9F and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif 0xEE <= b <= 0xEF: # UTF8-3 case 4
if i + 2 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and 0x80 <= bytes[i+2] <= 0xBF):
return False
i += 3
elif b == 0xF0: # UTF8-4 case 1
if i + 3 >= len(bytes):
return False
if not (0x90 <= bytes[i+1] <= 0xBF and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
elif 0xF1 <= b <= 0xF3: # UTF8-4 case 2
if i + 3 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0xBF and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
elif b == 0xF4: # UTF8-4 case 3
if i + 3 >= len(bytes):
return False
if not (0x80 <= bytes[i+1] <= 0x8F and
0x80 <= bytes[i+2] <= 0xBF and
0x80 <= bytes[i+3] <= 0xBF):
return False
i += 4
else:
return False # Invalid byte

return True