Skip to main content

7. Examples

This section provides concrete examples of UTF-8 encoding for various characters and character sequences.

Example 1: Mixed Script Characters

The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL TO><ALPHA>." is encoded in UTF-8 as follows:

--+--------+-----+--
41 E2 89 A2 CE 91 2E
--+--------+-----+--

Breakdown

  • U+0041 (A) → 41 (1 byte)
  • U+2262 (≢) → E2 89 A2 (3 bytes)
  • U+0391 (Α) → CE 91 (2 bytes)
  • U+002E (.) → 2E (1 byte)

Example 2: Korean

The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo", meaning "the Korean language") is encoded in UTF-8 as follows:

--------+--------+--------
ED 95 9C EA B5 AD EC 96 B4
--------+--------+--------

Breakdown

  • U+D55C (한) → ED 95 9C (3 bytes)
  • U+AD6D (국) → EA B5 AD (3 bytes)
  • U+C5B4 (어) → EC 96 B4 (3 bytes)

Example 3: Japanese

The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo", meaning "the Japanese language") is encoded in UTF-8 as follows:

--------+--------+--------
E6 97 A5 E6 9C AC E8 AA 9E
--------+--------+--------

Breakdown

  • U+65E5 (日) → E6 97 A5 (3 bytes)
  • U+672C (本) → E6 9C AC (3 bytes)
  • U+8A9E (語) → E8 AA 9E (3 bytes)

Example 4: Chinese Character with BOM

The character U+233B4 (a Chinese character meaning 'stump of tree'), prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:

--------+-----------
EF BB BF F0 A3 8E B4
--------+-----------

Breakdown

  • U+FEFF (BOM) → EF BB BF (3 bytes)
  • U+233B4 (𣎴) → F0 A3 8E B4 (4 bytes)

Note

This example demonstrates:

  1. The UTF-8 BOM encoding (EF BB BF)
  2. A 4-byte UTF-8 sequence for a character beyond the Basic Multilingual Plane (BMP)

Summary Table

Character(s)UnicodeUTF-8 EncodingBytes
AU+0041411
U+2262E2 89 A23
ΑU+0391CE 912
U+D55CED 95 9C3
U+AD6DEA B5 AD3
U+C5B4EC 96 B43
U+65E5E6 97 A53
U+672CE6 9C AC3
U+8A9EE8 AA 9E3
BOMU+FEFFEF BB BF3
𣎴U+233B4F0 A3 8E B44