2. Characters

This chapter defines the character sets, encoding mechanisms, and processing rules used in URIs.

Character Encoding Fundamentals

The URI syntax provides a method of encoding data into a sequence of characters, likely for the purpose of identifying a resource.

Encoding Hierarchy:

Resource → URI Characters → Octets → Transmission/Storage

Character Set: URIs are based on the US-ASCII character set, consisting of digits, letters, and a few graphic symbols

2.1. Percent-Encoding

Purpose

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter.

Encoding Format

pct-encoded = "%" HEXDIG HEXDIG

Format: A percent-encoding consists of a percent character "%" followed by the two hexadecimal digits representing that octet's numeric value

Examples

Character	Binary	Hexadecimal	Percent-Encoded
Space	00100000	0x20	`%20`
!	00100001	0x21	`%21`
#	00100011	0x23	`%23`
中 (Chinese)	-	0xE4B8AD	`%E4%B8%AD`

Case Rules

Equivalence: The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f'

Normalization: Two URIs that differ only in the case of hexadecimal digits used in percent-encoded octets are equivalent

Recommendation: URI producers and normalizers SHOULD use uppercase hexadecimal digits for all percent-encodings

Recommended: %2F %3A %5B
Not recommended: %2f %3a %5b

2.2. Reserved Characters

Definition

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they MAY (or MAY NOT) be defined as delimiters.

Reserved Character Set

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

Classification

General Delimiters (gen-delims)

Character	Purpose	Example
:	Separates scheme and authority	`http:`
/	Path separator	`/path/to/resource`
?	Query separator	`?key=value`
#	Fragment separator	`#section`
[ ]	IPv6 address boundaries	`[2001:db8::1]`
@	User information separator	`user@host`

Sub-Delimiters (sub-delims)

Character	Common Use
! $ ' ( ) *	Subcomponent separation in path or query
+	Alternative space representation
,	List separator
;	Parameter separator
=	Key-value separator
&	Query parameter separator

Encoding Rules

Conflict Handling: If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data MUST be percent-encoded before the URI is formed

Examples:

Path containing "?" character:
Original: /path/file?.txt
Encoded: /path/file%3F.txt

Query containing "&" character:
Original: ?name=Tom&Jerry
Correct: ?name=Tom%26Jerry (if & is not a delimiter)
Or: ?name=Tom&name=Jerry (if & is a delimiter)

Equivalence

Important: URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are NOT equivalent

http://example.com/path?key=value
http://example.com/path%3Fkey=value

These two URIs are NOT equivalent

2.3. Unreserved Characters

Definition

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved.

Unreserved Character Set

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

Includes:

ALPHA: Uppercase and lowercase letters (A-Z, a-z)
DIGIT: Decimal digits (0-9)
-: Hyphen
.: Period
_: Underscore
~: Tilde

Encoding Rules

Equivalence: URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent

Normalization: Percent-encoded octets corresponding to unreserved characters SHOULD be decoded

Equivalent URIs:
http://example.com/~user
http://example.com/%7Euser

Normalized to:
http://example.com/~user

Percent-Encoding Range

SHOULD NOT be created:

ALPHA: %41-%5A (A-Z), %61-%7A (a-z)
DIGIT: %30-%39 (0-9)
Hyphen: %2D
Period: %2E
Underscore: %5F
Tilde: %7E

SHOULD be decoded: When these encodings are found in a URI, normalizers SHOULD decode them to their corresponding unreserved characters

2.4. When to Encode or Decode

When to Encode

URI Producers:

When producing the URI, MUST percent-encode characters that are not allowed
Reserved characters are only left unencoded when used as delimiters
Unreserved characters SHOULD NOT be encoded

Examples:

# Encoding path
path = "/files/my document.pdf"
encoded = "/files/my%20document.pdf"

# Encoding query
query = "?name=John Doe&age=30"
encoded = "?name=John%20Doe&age=30"

When to Decode

URI Consumers:

After parsing the URI, decode components as needed
Do not decode prematurely (may change URI structure)
Decode each component only once

Dangerous Example:

Original: /path%2Fto%2Ffile
Premature decode: /path/to/file (changed path structure!)

Correct: Parse first, then decode each segment
Segment 1: "path%2Fto%2Ffile" → Decode → "path/to/file"

Double-Encoding Problem

Original data: "100%"
First encoding: "100%25"
Wrong second encoding: "100%2525"

When decoding:
"100%2525" → "100%25" → "100%"

2.5. Identifying Data

Character Sets and Encoding

Character vs Octet:

URIs are sequences of characters
Characters are encoded as octets for transmission/storage
UTF-8 is the recommended character encoding

Internationalized Resource Identifiers (IRI)

IRI Extension: RFC 3987 defines IRIs, which allow the use of Unicode characters

Conversion:

IRI: http://例え.jp/引き出し
       ↓ Encode to UTF-8 and percent-encode
URI: http://xn--r8jz45g.jp/%E5%BC%95%E3%81%8D%E5%87%BA%E3%81%97

Best Practices

URI Production:

Encode non-ASCII characters using UTF-8
Percent-encode the resulting octets
Use uppercase hexadecimal digits
Do not encode unreserved characters

URI Consumption:

Parse by component
Decode percent-encodings
Interpret octets using UTF-8
Handle invalid encodings

Character Set Quick Reference

Complete Character Classification

URI Characters
├── Unreserved (unreserved)
│   ├── ALPHA: A-Z, a-z
│   ├── DIGIT: 0-9
│   └── Symbols: - . _ ~
│
├── Reserved (reserved)
│   ├── General Delimiters (gen-delims): : / ? # [ ] @
│   └── Sub-Delimiters (sub-delims): ! $ & ' ( ) * + , ; =
│
└── Percent-Encoded (pct-encoded): %HEXDIG HEXDIG

Encoding Decision Tree

Character needs to appear in URI?
  ├─ Is unreserved character? → Use directly
  ├─ Is reserved character?
  │   ├─ Used as delimiter? → Use directly
  │   └─ Used as data? → Percent-encode
  └─ Other character? → Percent-encode

Common Character Encoding Table

Character	Purpose	Encoding
Space	Separation	`%20` or `+` (in query)
!	Sub-delimiter	`%21` (if encoding needed)
"	Quote	`%22`
#	Fragment delimiter	`%23` (in data)
$	Sub-delimiter	`%24` (if encoding needed)
%	Encoding marker	`%25`
&	Parameter separator	`%26` (in data)
'	Sub-delimiter	`%27` (if encoding needed)
( )	Sub-delimiter	`%28` `%29`
+	Space/Sub-delimiter	`%2B` (in data)
,	List separator	`%2C` (if encoding needed)
/	Path separator	`%2F` (in data)
:	Scheme separator	`%3A` (in data)
;	Parameter separator	`%3B` (if encoding needed)
=	Key-value separator	`%3D` (if encoding needed)
?	Query separator	`%3F` (in data)
@	User info separator	`%40` (in data)
[ ]	IPv6 boundaries	`%5B` `%5D`

Implementation Recommendations

Encoding Implementation

def percent_encode(text, safe=''):
    """Percent-encode text"""
    result = []
    for char in text:
        if char in safe or is_unreserved(char):
            result.append(char)
        else:
            # UTF-8 encode and percent-encode
            for byte in char.encode('utf-8'):
                result.append(f'%\{byte:02X}')
    return ''.join(result)

def is_unreserved(char):
    """Check if character is unreserved"""
    return (char.isalnum() or 
            char in '-._~')

Decoding Implementation

def percent_decode(text):
    """Percent-decode text"""
    result = bytearray()
    i = 0
    while i &lt; len(text):
        if text[i] == '%' and i + 2 &lt; len(text):
            try:
                byte = int(text[i+1:i+3], 16)
                result.append(byte)
                i += 3
            except ValueError:
                result.extend(text[i].encode('utf-8'))
                i += 1
        else:
            result.extend(text[i].encode('utf-8'))
            i += 1
    return result.decode('utf-8', errors='replace')

Next Chapter: 3. Syntax Components - Structural components of URIs

Character Encoding Fundamentals​

2.1. Percent-Encoding​

Purpose​

Encoding Format​

Examples​

Case Rules​

2.2. Reserved Characters​

Definition​

Reserved Character Set​

Classification​

General Delimiters (gen-delims)​

Sub-Delimiters (sub-delims)​

Encoding Rules​

Equivalence​

2.3. Unreserved Characters​

Definition​

Unreserved Character Set​

Encoding Rules​

Percent-Encoding Range​

2.4. When to Encode or Decode​

When to Encode​

When to Decode​

Double-Encoding Problem​

2.5. Identifying Data​

Character Sets and Encoding​

Internationalized Resource Identifiers (IRI)​

Best Practices​

Character Set Quick Reference​

Complete Character Classification​

Encoding Decision Tree​

Common Character Encoding Table​

Implementation Recommendations​

Encoding Implementation​

Decoding Implementation​

Character Encoding Fundamentals

2.1. Percent-Encoding

Purpose

Encoding Format

Examples

Case Rules

2.2. Reserved Characters

Definition

Reserved Character Set

Classification

General Delimiters (gen-delims)

Sub-Delimiters (sub-delims)

Encoding Rules

Equivalence

2.3. Unreserved Characters

Definition

Unreserved Character Set

Encoding Rules

Percent-Encoding Range

2.4. When to Encode or Decode

When to Encode

When to Decode

Double-Encoding Problem

2.5. Identifying Data

Character Sets and Encoding

Internationalized Resource Identifiers (IRI)

Best Practices

Character Set Quick Reference

Complete Character Classification

Encoding Decision Tree

Common Character Encoding Table

Implementation Recommendations

Encoding Implementation

Decoding Implementation