RFC 3629 - UTF-8, a transformation format of ISO 10646

Publication Date: November 2003
Status: Internet Standard (STD 63)
Author: F. Yergeau (Alis Technologies)
Obsoletes: RFC 2279
Category: Standards Track

Abstract

ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.

Status of this Memo

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

Copyright Notice

Main Sections

Why is UTF-8 Important?

UTF-8 is the standard character encoding for the modern Internet. Nearly all modern web applications, APIs, and data formats use UTF-8.

Core Advantages

Feature	Description	Importance
ASCII Compatible	ASCII characters encoded identically	⭐⭐⭐⭐⭐
No Byte Order Issues	No endianness problems	⭐⭐⭐⭐⭐
Self-Synchronizing	Can decode from any position	⭐⭐⭐⭐
Space Efficient	1 byte for English, 3 bytes for CJK	⭐⭐⭐⭐
Universal Support	Supports all Unicode characters	⭐⭐⭐⭐⭐

UTF-8 Encoding Rules Quick Reference

Encoding Table

Unicode Range           Bytes  UTF-8 Byte Pattern
─────────────────────────────────────────────
U+0000 - U+007F           1    0xxxxxxx
U+0080 - U+07FF           2    110xxxxx 10xxxxxx
U+0800 - U+FFFF           3    1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF        4    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Character Range Coverage

1 byte (ASCII):
- Latin letters, digits, basic punctuation
- Control characters
- Range: U+0000 - U+007F

2 bytes:
- Latin extensions
- Greek, Cyrillic, Arabic, Hebrew
- Range: U+0080 - U+07FF

3 bytes:
- CJK (Chinese, Japanese, Korean) characters
- Most other writing systems
- Range: U+0800 - U+FFFF

4 bytes:
- Emoji
- Historical scripts, rare CJK characters
- Range: U+10000 - U+10FFFF

Encoding Examples

ASCII Character

Character: 'A'
Unicode: U+0041
Binary: 0100 0001
UTF-8: 0x41
Bytes: 1

Encoding Process:
U+0041 &lt; U+007F → use 1-byte template
0xxxxxxx → 01000001 → 0x41

Chinese Character

Character: '你'
Unicode: U+4F60
Binary: 0100 1111 0110 0000
UTF-8: 0xE4 0xBD 0xA0
Bytes: 3

Encoding Process:
U+4F60 in U+0800-U+FFFF range → use 3-byte template
1110xxxx 10xxxxxx 10xxxxxx
    ↓        ↓        ↓
   0100    111101   100000
    ↓        ↓        ↓
  11100100 10111101 10100000
    0xE4     0xBD     0xA0

Emoji

Character: '😀'
Unicode: U+1F600
Binary: 0001 1111 0110 0000 0000 0000
UTF-8: 0xF0 0x9F 0x98 0x80
Bytes: 4

Encoding Process:
U+1F600 in U+10000-U+10FFFF range → use 4-byte template
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    ↓        ↓        ↓        ↓
  000011   111101   100000   000000
    ↓        ↓        ↓        ↓
 11110000 10011111 10011000 10000000
   0xF0     0x9F     0x98     0x80

Unique Features of UTF-8

1. ASCII Compatibility

ASCII file = Valid UTF-8 file

Example:
Hello World (ASCII)
is also valid UTF-8

Reason:
ASCII uses 7 bits (0xxxxxxx)
UTF-8's 1-byte form is ASCII

2. Self-Synchronizing

UTF-8 byte stream:
... E4 BD A0 E5 A5 BD ...
     你      好

Starting from any position:
- First byte (1110xxxx or 110xxxxx or 11110xxx) marks character start
- Continuation bytes (10xxxxxx) never mistaken for first byte

Example:
E4 BD A0 E5 A5 BD
   ↑     ↑
   Starting here identifies this as continuation byte
         Starting here identifies new character

3. No Byte Order Issues

UTF-16 needs BOM:
FE FF ... (Big Endian)
FF FE ... (Little Endian)

UTF-8 doesn't need it:
Byte order is fixed, high to low
No BOM needed to indicate byte order

Common Applications

Web Development

<!-- HTML file -->
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>UTF-8 Example</title>
</head>
<body>
  <p>你好，世界！ Hello, World! 😀</p>
</body>
</html>

HTTP Protocol

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1234

<!DOCTYPE html>...

JSON Data

{
  "name": "张三",
  "message": "Hello 世界",
  "emoji": "😀"
}

Database

CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Security Considerations Highlights

⚠️ Non-Shortest Form Attack

Prohibited overlong encodings:

Correct: 'A' → 0x41 (1 byte)
Wrong: 'A' → 0xC0 0x81 (2 bytes, overlong)
       'A' → 0xE0 0x80 0x81 (3 bytes, overlong)

Danger:
Overlong encodings may bypass security checks
Example: path traversal "../" in overlong encoding

⚠️ Invalid Sequences

Must reject:
- Isolated continuation bytes (10xxxxxx)
- Beyond Unicode range (>U+10FFFF)
- UTF-16 surrogate pairs (U+D800-U+DFFF)
- Truncated multi-byte sequences

Programming Language Support

Python

# Encoding
s = "你好世界"
b = s.encode('utf-8')  # bytes object

# Decoding
s = b.decode('utf-8')  # str object

# File operations
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

JavaScript

// Encoding
const str = "你好世界";
const encoder = new TextEncoder();
const bytes = encoder.encode(str);  // Uint8Array

// Decoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);  // string

Java

// Encoding
String str = "你好世界";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// Decoding
String decoded = new String(bytes, StandardCharsets.UTF_8);

// File operations
Files.readString(path, StandardCharsets.UTF_8);

Go

// Go's string is natively UTF-8
s := "你好世界"

// Convert to byte slice
b := []byte(s)

// Convert from byte slice
s = string(b)

Performance Characteristics

Space Efficiency Comparison

Text Type	UTF-8	UTF-16	UTF-32
English	1 byte	2 bytes	4 bytes
Chinese	3 bytes	2 bytes	4 bytes
Emoji	4 bytes	4 bytes	4 bytes

English-dominant text: UTF-8 optimal
CJK-dominant text: UTF-16 slightly better
Mixed text: UTF-8 usually optimal

Official Text: RFC 3629 (TXT)
Official Page: RFC 3629 DataTracker
Standard: STD 63
Obsoletes: RFC 2279
Unicode Standard: Unicode.org
ISO 10646: ISO/IEC 10646

Quick Diagnostic Tools

Identify UTF-8 Encoding

def is_utf8(data):
    """Detect if data is valid UTF-8"""
    try:
        data.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

Fix Encoding Issues

# Common problem: double encoding
# Original: "你好"
# Wrong display: "ä½ å¥½"

# Fix method:
text = "ä½ å¥½"
fixed = text.encode('latin1').decode('utf-8')
# Result: "你好"

Important Note: UTF-8 is the default standard for the modern Internet. Always use UTF-8 encoding and avoid legacy encodings such as GBK, ISO-8859-1, Windows-1252, etc. All new projects should use UTF-8 as the sole character encoding.

Abstract​

Status of this Memo​

Copyright Notice​

Table of Contents​

Main Sections​

Why is UTF-8 Important?​

Core Advantages​

UTF-8 Encoding Rules Quick Reference​

Encoding Table​

Character Range Coverage​

Encoding Examples​

ASCII Character​

Chinese Character​

Emoji​

Unique Features of UTF-8​

1. ASCII Compatibility​

2. Self-Synchronizing​

3. No Byte Order Issues​

Common Applications​

Web Development​

HTTP Protocol​

JSON Data​

Database​

Security Considerations Highlights​

⚠️ Non-Shortest Form Attack​

⚠️ Invalid Sequences​

Programming Language Support​

Python​

JavaScript​

Java​

Go​

Performance Characteristics​

Space Efficiency Comparison​

Related Resources​

Quick Diagnostic Tools​

Identify UTF-8 Encoding​

Fix Encoding Issues​