RFC 3629 - UTF-8, a transformation format of ISO 10646
Publication Date: November 2003
Status: Internet Standard (STD 63)
Author: F. Yergeau (Alis Technologies)
Obsoletes: RFC 2279
Category: Standards Track
Abstract
ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.
Status of this Memo
This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Table of Contents
Main Sections
- 1. Introduction
- 2. Notational conventions
- 3. UTF-8 definition
- 4. Syntax of UTF-8 Byte Sequences
- 5. Versions of the standards
- 6. Byte order mark (BOM)
- 7. Examples
- 8. MIME registration
- 9. IANA Considerations
- 10. Security Considerations
- 11. Acknowledgements
- 12. Changes from RFC 2279
- 13. Normative References
- 14. Informative References
Why is UTF-8 Important?
UTF-8 is the standard character encoding for the modern Internet. Nearly all modern web applications, APIs, and data formats use UTF-8.
Core Advantages
| Feature | Description | Importance |
|---|---|---|
| ASCII Compatible | ASCII characters encoded identically | ⭐⭐⭐⭐⭐ |
| No Byte Order Issues | No endianness problems | ⭐⭐⭐⭐⭐ |
| Self-Synchronizing | Can decode from any position | ⭐⭐⭐⭐ |
| Space Efficient | 1 byte for English, 3 bytes for CJK | ⭐⭐⭐⭐ |
| Universal Support | Supports all Unicode characters | ⭐⭐⭐⭐⭐ |
UTF-8 Encoding Rules Quick Reference
Encoding Table
Unicode Range Bytes UTF-8 Byte Pattern
─────────────────────────────────────────────
U+0000 - U+007F 1 0xxxxxxx
U+0080 - U+07FF 2 110xxxxx 10xxxxxx
U+0800 - U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Character Range Coverage
1 byte (ASCII):
- Latin letters, digits, basic punctuation
- Control characters
- Range: U+0000 - U+007F
2 bytes:
- Latin extensions
- Greek, Cyrillic, Arabic, Hebrew
- Range: U+0080 - U+07FF
3 bytes:
- CJK (Chinese, Japanese, Korean) characters
- Most other writing systems
- Range: U+0800 - U+FFFF
4 bytes:
- Emoji
- Historical scripts, rare CJK characters
- Range: U+10000 - U+10FFFF
Encoding Examples
ASCII Character
Character: 'A'
Unicode: U+0041
Binary: 0100 0001
UTF-8: 0x41
Bytes: 1
Encoding Process:
U+0041 < U+007F → use 1-byte template
0xxxxxxx → 01000001 → 0x41
Chinese Character
Character: '你'
Unicode: U+4F60
Binary: 0100 1111 0110 0000
UTF-8: 0xE4 0xBD 0xA0
Bytes: 3
Encoding Process:
U+4F60 in U+0800-U+FFFF range → use 3-byte template
1110xxxx 10xxxxxx 10xxxxxx
↓ ↓ ↓
0100 111101 100000
↓ ↓ ↓
11100100 10111101 10100000
0xE4 0xBD 0xA0
Emoji
Character: '😀'
Unicode: U+1F600
Binary: 0001 1111 0110 0000 0000 0000
UTF-8: 0xF0 0x9F 0x98 0x80
Bytes: 4
Encoding Process:
U+1F600 in U+10000-U+10FFFF range → use 4-byte template
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
↓ ↓ ↓ ↓
000011 111101 100000 000000
↓ ↓ ↓ ↓
11110000 10011111 10011000 10000000
0xF0 0x9F 0x98 0x80
Unique Features of UTF-8
1. ASCII Compatibility
ASCII file = Valid UTF-8 file
Example:
Hello World (ASCII)
is also valid UTF-8
Reason:
ASCII uses 7 bits (0xxxxxxx)
UTF-8's 1-byte form is ASCII
2. Self-Synchronizing
UTF-8 byte stream:
... E4 BD A0 E5 A5 BD ...
你 好
Starting from any position:
- First byte (1110xxxx or 110xxxxx or 11110xxx) marks character start
- Continuation bytes (10xxxxxx) never mistaken for first byte
Example:
E4 BD A0 E5 A5 BD
↑ ↑
Starting here identifies this as continuation byte
Starting here identifies new character
3. No Byte Order Issues
UTF-16 needs BOM:
FE FF ... (Big Endian)
FF FE ... (Little Endian)
UTF-8 doesn't need it:
Byte order is fixed, high to low
No BOM needed to indicate byte order
Common Applications
Web Development
<!-- HTML file -->
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>UTF-8 Example</title>
</head>
<body>
<p>你好,世界! Hello, World! 😀</p>
</body>
</html>
HTTP Protocol
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1234
<!DOCTYPE html>...
JSON Data
{
"name": "张三",
"message": "Hello 世界",
"emoji": "😀"
}
Database
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Security Considerations Highlights
⚠️ Non-Shortest Form Attack
Prohibited overlong encodings:
Correct: 'A' → 0x41 (1 byte)
Wrong: 'A' → 0xC0 0x81 (2 bytes, overlong)
'A' → 0xE0 0x80 0x81 (3 bytes, overlong)
Danger:
Overlong encodings may bypass security checks
Example: path traversal "../" in overlong encoding
⚠️ Invalid Sequences
Must reject:
- Isolated continuation bytes (10xxxxxx)
- Beyond Unicode range (>U+10FFFF)
- UTF-16 surrogate pairs (U+D800-U+DFFF)
- Truncated multi-byte sequences
Programming Language Support
Python
# Encoding
s = "你好世界"
b = s.encode('utf-8') # bytes object
# Decoding
s = b.decode('utf-8') # str object
# File operations
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
JavaScript
// Encoding
const str = "你好世界";
const encoder = new TextEncoder();
const bytes = encoder.encode(str); // Uint8Array
// Decoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes); // string
Java
// Encoding
String str = "你好世界";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
// Decoding
String decoded = new String(bytes, StandardCharsets.UTF_8);
// File operations
Files.readString(path, StandardCharsets.UTF_8);
Go
// Go's string is natively UTF-8
s := "你好世界"
// Convert to byte slice
b := []byte(s)
// Convert from byte slice
s = string(b)
Performance Characteristics
Space Efficiency Comparison
| Text Type | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| English | 1 byte | 2 bytes | 4 bytes |
| Chinese | 3 bytes | 2 bytes | 4 bytes |
| Emoji | 4 bytes | 4 bytes | 4 bytes |
English-dominant text: UTF-8 optimal
CJK-dominant text: UTF-16 slightly better
Mixed text: UTF-8 usually optimal
Related Resources
- Official Text: RFC 3629 (TXT)
- Official Page: RFC 3629 DataTracker
- Standard: STD 63
- Obsoletes: RFC 2279
- Unicode Standard: Unicode.org
- ISO 10646: ISO/IEC 10646
Quick Diagnostic Tools
Identify UTF-8 Encoding
def is_utf8(data):
"""Detect if data is valid UTF-8"""
try:
data.decode('utf-8')
return True
except UnicodeDecodeError:
return False
Fix Encoding Issues
# Common problem: double encoding
# Original: "你好"
# Wrong display: "ä½ å¥½"
# Fix method:
text = "ä½ å¥½"
fixed = text.encode('latin1').decode('utf-8')
# Result: "你好"
Important Note: UTF-8 is the default standard for the modern Internet. Always use UTF-8 encoding and avoid legacy encodings such as GBK, ISO-8859-1, Windows-1252, etc. All new projects should use UTF-8 as the sole character encoding.