Skip to main content

RFC 3629 - UTF-8, a transformation format of ISO 10646

Publication Date: November 2003
Status: Internet Standard (STD 63)
Author: F. Yergeau (Alis Technologies)
Obsoletes: RFC 2279
Category: Standards Track


Abstract

ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.


Status of this Memo

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.


Copyright (C) The Internet Society (2003). All Rights Reserved.


Table of Contents

Main Sections


Why is UTF-8 Important?

UTF-8 is the standard character encoding for the modern Internet. Nearly all modern web applications, APIs, and data formats use UTF-8.

Core Advantages

FeatureDescriptionImportance
ASCII CompatibleASCII characters encoded identically⭐⭐⭐⭐⭐
No Byte Order IssuesNo endianness problems⭐⭐⭐⭐⭐
Self-SynchronizingCan decode from any position⭐⭐⭐⭐
Space Efficient1 byte for English, 3 bytes for CJK⭐⭐⭐⭐
Universal SupportSupports all Unicode characters⭐⭐⭐⭐⭐

UTF-8 Encoding Rules Quick Reference

Encoding Table

Unicode Range           Bytes  UTF-8 Byte Pattern
─────────────────────────────────────────────
U+0000 - U+007F 1 0xxxxxxx
U+0080 - U+07FF 2 110xxxxx 10xxxxxx
U+0800 - U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Character Range Coverage

1 byte (ASCII):
- Latin letters, digits, basic punctuation
- Control characters
- Range: U+0000 - U+007F

2 bytes:
- Latin extensions
- Greek, Cyrillic, Arabic, Hebrew
- Range: U+0080 - U+07FF

3 bytes:
- CJK (Chinese, Japanese, Korean) characters
- Most other writing systems
- Range: U+0800 - U+FFFF

4 bytes:
- Emoji
- Historical scripts, rare CJK characters
- Range: U+10000 - U+10FFFF

Encoding Examples

ASCII Character

Character: 'A'
Unicode: U+0041
Binary: 0100 0001
UTF-8: 0x41
Bytes: 1

Encoding Process:
U+0041 < U+007F → use 1-byte template
0xxxxxxx → 01000001 → 0x41

Chinese Character

Character: '你'
Unicode: U+4F60
Binary: 0100 1111 0110 0000
UTF-8: 0xE4 0xBD 0xA0
Bytes: 3

Encoding Process:
U+4F60 in U+0800-U+FFFF range → use 3-byte template
1110xxxx 10xxxxxx 10xxxxxx
↓ ↓ ↓
0100 111101 100000
↓ ↓ ↓
11100100 10111101 10100000
0xE4 0xBD 0xA0

Emoji

Character: '😀'
Unicode: U+1F600
Binary: 0001 1111 0110 0000 0000 0000
UTF-8: 0xF0 0x9F 0x98 0x80
Bytes: 4

Encoding Process:
U+1F600 in U+10000-U+10FFFF range → use 4-byte template
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
↓ ↓ ↓ ↓
000011 111101 100000 000000
↓ ↓ ↓ ↓
11110000 10011111 10011000 10000000
0xF0 0x9F 0x98 0x80

Unique Features of UTF-8

1. ASCII Compatibility

ASCII file = Valid UTF-8 file

Example:
Hello World (ASCII)
is also valid UTF-8

Reason:
ASCII uses 7 bits (0xxxxxxx)
UTF-8's 1-byte form is ASCII

2. Self-Synchronizing

UTF-8 byte stream:
... E4 BD A0 E5 A5 BD ...
你 好

Starting from any position:
- First byte (1110xxxx or 110xxxxx or 11110xxx) marks character start
- Continuation bytes (10xxxxxx) never mistaken for first byte

Example:
E4 BD A0 E5 A5 BD
↑ ↑
Starting here identifies this as continuation byte
Starting here identifies new character

3. No Byte Order Issues

UTF-16 needs BOM:
FE FF ... (Big Endian)
FF FE ... (Little Endian)

UTF-8 doesn't need it:
Byte order is fixed, high to low
No BOM needed to indicate byte order

Common Applications

Web Development

<!-- HTML file -->
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>UTF-8 Example</title>
</head>
<body>
<p>你好,世界! Hello, World! 😀</p>
</body>
</html>

HTTP Protocol

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1234

<!DOCTYPE html>...

JSON Data

{
"name": "张三",
"message": "Hello 世界",
"emoji": "😀"
}

Database

CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Security Considerations Highlights

⚠️ Non-Shortest Form Attack

Prohibited overlong encodings:

Correct: 'A' → 0x41 (1 byte)
Wrong: 'A' → 0xC0 0x81 (2 bytes, overlong)
'A' → 0xE0 0x80 0x81 (3 bytes, overlong)

Danger:
Overlong encodings may bypass security checks
Example: path traversal "../" in overlong encoding

⚠️ Invalid Sequences

Must reject:
- Isolated continuation bytes (10xxxxxx)
- Beyond Unicode range (>U+10FFFF)
- UTF-16 surrogate pairs (U+D800-U+DFFF)
- Truncated multi-byte sequences

Programming Language Support

Python

# Encoding
s = "你好世界"
b = s.encode('utf-8') # bytes object

# Decoding
s = b.decode('utf-8') # str object

# File operations
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()

JavaScript

// Encoding
const str = "你好世界";
const encoder = new TextEncoder();
const bytes = encoder.encode(str); // Uint8Array

// Decoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes); // string

Java

// Encoding
String str = "你好世界";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// Decoding
String decoded = new String(bytes, StandardCharsets.UTF_8);

// File operations
Files.readString(path, StandardCharsets.UTF_8);

Go

// Go's string is natively UTF-8
s := "你好世界"

// Convert to byte slice
b := []byte(s)

// Convert from byte slice
s = string(b)

Performance Characteristics

Space Efficiency Comparison

Text TypeUTF-8UTF-16UTF-32
English1 byte2 bytes4 bytes
Chinese3 bytes2 bytes4 bytes
Emoji4 bytes4 bytes4 bytes
English-dominant text: UTF-8 optimal
CJK-dominant text: UTF-16 slightly better
Mixed text: UTF-8 usually optimal


Quick Diagnostic Tools

Identify UTF-8 Encoding

def is_utf8(data):
"""Detect if data is valid UTF-8"""
try:
data.decode('utf-8')
return True
except UnicodeDecodeError:
return False

Fix Encoding Issues

# Common problem: double encoding
# Original: "你好"
# Wrong display: "ä½ å¥½"

# Fix method:
text = "ä½ å¥½"
fixed = text.encode('latin1').decode('utf-8')
# Result: "你好"

Important Note: UTF-8 is the default standard for the modern Internet. Always use UTF-8 encoding and avoid legacy encodings such as GBK, ISO-8859-1, Windows-1252, etc. All new projects should use UTF-8 as the sole character encoding.