2. Characters
This chapter defines the character sets, encoding mechanisms, and processing rules used in URIs.
Character Encoding Fundamentals
The URI syntax provides a method of encoding data into a sequence of characters, likely for the purpose of identifying a resource.
Encoding Hierarchy:
Resource → URI Characters → Octets → Transmission/Storage
Character Set: URIs are based on the US-ASCII character set, consisting of digits, letters, and a few graphic symbols
2.1. Percent-Encoding
Purpose
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter.
Encoding Format
pct-encoded = "%" HEXDIG HEXDIG
Format: A percent-encoding consists of a percent character "%" followed by the two hexadecimal digits representing that octet's numeric value
Examples
| Character | Binary | Hexadecimal | Percent-Encoded |
|---|---|---|---|
| Space | 00100000 | 0x20 | %20 |
| ! | 00100001 | 0x21 | %21 |
| # | 00100011 | 0x23 | %23 |
| 中 (Chinese) | - | 0xE4B8AD | %E4%B8%AD |
Case Rules
Equivalence: The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f'
Normalization: Two URIs that differ only in the case of hexadecimal digits used in percent-encoded octets are equivalent
Recommendation: URI producers and normalizers SHOULD use uppercase hexadecimal digits for all percent-encodings
Recommended: %2F %3A %5B
Not recommended: %2f %3a %5b
2.2. Reserved Characters
Definition
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they MAY (or MAY NOT) be defined as delimiters.
Reserved Character Set
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Classification
General Delimiters (gen-delims)
| Character | Purpose | Example |
|---|---|---|
| : | Separates scheme and authority | http: |
| / | Path separator | /path/to/resource |
| ? | Query separator | ?key=value |
| # | Fragment separator | #section |
| [ ] | IPv6 address boundaries | [2001:db8::1] |
| @ | User information separator | user@host |
Sub-Delimiters (sub-delims)
| Character | Common Use |
|---|---|
| ! $ ' ( ) * | Subcomponent separation in path or query |
| + | Alternative space representation |
| , | List separator |
| ; | Parameter separator |
| = | Key-value separator |
| & | Query parameter separator |
Encoding Rules
Conflict Handling: If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data MUST be percent-encoded before the URI is formed
Examples:
Path containing "?" character:
Original: /path/file?.txt
Encoded: /path/file%3F.txt
Query containing "&" character:
Original: ?name=Tom&Jerry
Correct: ?name=Tom%26Jerry (if & is not a delimiter)
Or: ?name=Tom&name=Jerry (if & is a delimiter)
Equivalence
Important: URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are NOT equivalent
http://example.com/path?key=value
http://example.com/path%3Fkey=value
These two URIs are NOT equivalent
2.3. Unreserved Characters
Definition
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved.
Unreserved Character Set
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Includes:
- ALPHA: Uppercase and lowercase letters (A-Z, a-z)
- DIGIT: Decimal digits (0-9)
- -: Hyphen
- .: Period
- _: Underscore
- ~: Tilde
Encoding Rules
Equivalence: URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent
Normalization: Percent-encoded octets corresponding to unreserved characters SHOULD be decoded
Equivalent URIs:
http://example.com/~user
http://example.com/%7Euser
Normalized to:
http://example.com/~user
Percent-Encoding Range
SHOULD NOT be created:
- ALPHA:
%41-%5A(A-Z),%61-%7A(a-z) - DIGIT:
%30-%39(0-9) - Hyphen:
%2D - Period:
%2E - Underscore:
%5F - Tilde:
%7E
SHOULD be decoded: When these encodings are found in a URI, normalizers SHOULD decode them to their corresponding unreserved characters
2.4. When to Encode or Decode
When to Encode
URI Producers:
- When producing the URI, MUST percent-encode characters that are not allowed
- Reserved characters are only left unencoded when used as delimiters
- Unreserved characters SHOULD NOT be encoded
Examples:
# Encoding path
path = "/files/my document.pdf"
encoded = "/files/my%20document.pdf"
# Encoding query
query = "?name=John Doe&age=30"
encoded = "?name=John%20Doe&age=30"
When to Decode
URI Consumers:
- After parsing the URI, decode components as needed
- Do not decode prematurely (may change URI structure)
- Decode each component only once
Dangerous Example:
Original: /path%2Fto%2Ffile
Premature decode: /path/to/file (changed path structure!)
Correct: Parse first, then decode each segment
Segment 1: "path%2Fto%2Ffile" → Decode → "path/to/file"
Double-Encoding Problem
Original data: "100%"
First encoding: "100%25"
Wrong second encoding: "100%2525"
When decoding:
"100%2525" → "100%25" → "100%"
2.5. Identifying Data
Character Sets and Encoding
Character vs Octet:
- URIs are sequences of characters
- Characters are encoded as octets for transmission/storage
- UTF-8 is the recommended character encoding
Internationalized Resource Identifiers (IRI)
IRI Extension: RFC 3987 defines IRIs, which allow the use of Unicode characters
Conversion:
IRI: http://例え.jp/引き出し
↓ Encode to UTF-8 and percent-encode
URI: http://xn--r8jz45g.jp/%E5%BC%95%E3%81%8D%E5%87%BA%E3%81%97
Best Practices
URI Production:
- Encode non-ASCII characters using UTF-8
- Percent-encode the resulting octets
- Use uppercase hexadecimal digits
- Do not encode unreserved characters
URI Consumption:
- Parse by component
- Decode percent-encodings
- Interpret octets using UTF-8
- Handle invalid encodings
Character Set Quick Reference
Complete Character Classification
URI Characters
├── Unreserved (unreserved)
│ ├── ALPHA: A-Z, a-z
│ ├── DIGIT: 0-9
│ └── Symbols: - . _ ~
│
├── Reserved (reserved)
│ ├── General Delimiters (gen-delims): : / ? # [ ] @
│ └── Sub-Delimiters (sub-delims): ! $ & ' ( ) * + , ; =
│
└── Percent-Encoded (pct-encoded): %HEXDIG HEXDIG
Encoding Decision Tree
Character needs to appear in URI?
├─ Is unreserved character? → Use directly
├─ Is reserved character?
│ ├─ Used as delimiter? → Use directly
│ └─ Used as data? → Percent-encode
└─ Other character? → Percent-encode
Common Character Encoding Table
| Character | Purpose | Encoding |
|---|---|---|
| Space | Separation | %20 or + (in query) |
| ! | Sub-delimiter | %21 (if encoding needed) |
| " | Quote | %22 |
| # | Fragment delimiter | %23 (in data) |
| $ | Sub-delimiter | %24 (if encoding needed) |
| % | Encoding marker | %25 |
| & | Parameter separator | %26 (in data) |
| ' | Sub-delimiter | %27 (if encoding needed) |
| ( ) | Sub-delimiter | %28 %29 |
| + | Space/Sub-delimiter | %2B (in data) |
| , | List separator | %2C (if encoding needed) |
| / | Path separator | %2F (in data) |
| : | Scheme separator | %3A (in data) |
| ; | Parameter separator | %3B (if encoding needed) |
| = | Key-value separator | %3D (if encoding needed) |
| ? | Query separator | %3F (in data) |
| @ | User info separator | %40 (in data) |
| [ ] | IPv6 boundaries | %5B %5D |
Implementation Recommendations
Encoding Implementation
def percent_encode(text, safe=''):
"""Percent-encode text"""
result = []
for char in text:
if char in safe or is_unreserved(char):
result.append(char)
else:
# UTF-8 encode and percent-encode
for byte in char.encode('utf-8'):
result.append(f'%{byte:02X}')
return ''.join(result)
def is_unreserved(char):
"""Check if character is unreserved"""
return (char.isalnum() or
char in '-._~')
Decoding Implementation
def percent_decode(text):
"""Percent-decode text"""
result = bytearray()
i = 0
while i < len(text):
if text[i] == '%' and i + 2 < len(text):
try:
byte = int(text[i+1:i+3], 16)
result.append(byte)
i += 3
except ValueError:
result.extend(text[i].encode('utf-8'))
i += 1
else:
result.extend(text[i].encode('utf-8'))
i += 1
return result.decode('utf-8', errors='replace')
Next Chapter: 3. Syntax Components - Structural components of URIs