Skip to main content

2. Characters

This chapter defines the character sets, encoding mechanisms, and processing rules used in URIs.


Character Encoding Fundamentals

The URI syntax provides a method of encoding data into a sequence of characters, likely for the purpose of identifying a resource.

Encoding Hierarchy:

Resource → URI Characters → Octets → Transmission/Storage

Character Set: URIs are based on the US-ASCII character set, consisting of digits, letters, and a few graphic symbols


2.1. Percent-Encoding

Purpose

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter.

Encoding Format

pct-encoded = "%" HEXDIG HEXDIG

Format: A percent-encoding consists of a percent character "%" followed by the two hexadecimal digits representing that octet's numeric value

Examples

CharacterBinaryHexadecimalPercent-Encoded
Space001000000x20%20
!001000010x21%21
#001000110x23%23
(Chinese)-0xE4B8AD%E4%B8%AD

Case Rules

Equivalence: The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f'

Normalization: Two URIs that differ only in the case of hexadecimal digits used in percent-encoded octets are equivalent

Recommendation: URI producers and normalizers SHOULD use uppercase hexadecimal digits for all percent-encodings

Recommended: %2F %3A %5B
Not recommended: %2f %3a %5b

2.2. Reserved Characters

Definition

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they MAY (or MAY NOT) be defined as delimiters.

Reserved Character Set

reserved    = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

Classification

General Delimiters (gen-delims)

CharacterPurposeExample
:Separates scheme and authorityhttp:
/Path separator/path/to/resource
?Query separator?key=value
#Fragment separator#section
[ ]IPv6 address boundaries[2001:db8::1]
@User information separatoruser@host

Sub-Delimiters (sub-delims)

CharacterCommon Use
! $ ' ( ) *Subcomponent separation in path or query
+Alternative space representation
,List separator
;Parameter separator
=Key-value separator
&Query parameter separator

Encoding Rules

Conflict Handling: If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data MUST be percent-encoded before the URI is formed

Examples:

Path containing "?" character:
Original: /path/file?.txt
Encoded: /path/file%3F.txt

Query containing "&" character:
Original: ?name=Tom&Jerry
Correct: ?name=Tom%26Jerry (if & is not a delimiter)
Or: ?name=Tom&name=Jerry (if & is a delimiter)

Equivalence

Important: URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are NOT equivalent

http://example.com/path?key=value
http://example.com/path%3Fkey=value

These two URIs are NOT equivalent

2.3. Unreserved Characters

Definition

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved.

Unreserved Character Set

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

Includes:

  • ALPHA: Uppercase and lowercase letters (A-Z, a-z)
  • DIGIT: Decimal digits (0-9)
  • -: Hyphen
  • .: Period
  • _: Underscore
  • ~: Tilde

Encoding Rules

Equivalence: URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent

Normalization: Percent-encoded octets corresponding to unreserved characters SHOULD be decoded

Equivalent URIs:
http://example.com/~user
http://example.com/%7Euser

Normalized to:
http://example.com/~user

Percent-Encoding Range

SHOULD NOT be created:

  • ALPHA: %41-%5A (A-Z), %61-%7A (a-z)
  • DIGIT: %30-%39 (0-9)
  • Hyphen: %2D
  • Period: %2E
  • Underscore: %5F
  • Tilde: %7E

SHOULD be decoded: When these encodings are found in a URI, normalizers SHOULD decode them to their corresponding unreserved characters


2.4. When to Encode or Decode

When to Encode

URI Producers:

  1. When producing the URI, MUST percent-encode characters that are not allowed
  2. Reserved characters are only left unencoded when used as delimiters
  3. Unreserved characters SHOULD NOT be encoded

Examples:

# Encoding path
path = "/files/my document.pdf"
encoded = "/files/my%20document.pdf"

# Encoding query
query = "?name=John Doe&age=30"
encoded = "?name=John%20Doe&age=30"

When to Decode

URI Consumers:

  1. After parsing the URI, decode components as needed
  2. Do not decode prematurely (may change URI structure)
  3. Decode each component only once

Dangerous Example:

Original: /path%2Fto%2Ffile
Premature decode: /path/to/file (changed path structure!)

Correct: Parse first, then decode each segment
Segment 1: "path%2Fto%2Ffile" → Decode → "path/to/file"

Double-Encoding Problem

Original data: "100%"
First encoding: "100%25"
Wrong second encoding: "100%2525"

When decoding:
"100%2525" → "100%25" → "100%"

2.5. Identifying Data

Character Sets and Encoding

Character vs Octet:

  • URIs are sequences of characters
  • Characters are encoded as octets for transmission/storage
  • UTF-8 is the recommended character encoding

Internationalized Resource Identifiers (IRI)

IRI Extension: RFC 3987 defines IRIs, which allow the use of Unicode characters

Conversion:

IRI: http://例え.jp/引き出し
↓ Encode to UTF-8 and percent-encode
URI: http://xn--r8jz45g.jp/%E5%BC%95%E3%81%8D%E5%87%BA%E3%81%97

Best Practices

URI Production:

  1. Encode non-ASCII characters using UTF-8
  2. Percent-encode the resulting octets
  3. Use uppercase hexadecimal digits
  4. Do not encode unreserved characters

URI Consumption:

  1. Parse by component
  2. Decode percent-encodings
  3. Interpret octets using UTF-8
  4. Handle invalid encodings

Character Set Quick Reference

Complete Character Classification

URI Characters
├── Unreserved (unreserved)
│ ├── ALPHA: A-Z, a-z
│ ├── DIGIT: 0-9
│ └── Symbols: - . _ ~

├── Reserved (reserved)
│ ├── General Delimiters (gen-delims): : / ? # [ ] @
│ └── Sub-Delimiters (sub-delims): ! $ & ' ( ) * + , ; =

└── Percent-Encoded (pct-encoded): %HEXDIG HEXDIG

Encoding Decision Tree

Character needs to appear in URI?
├─ Is unreserved character? → Use directly
├─ Is reserved character?
│ ├─ Used as delimiter? → Use directly
│ └─ Used as data? → Percent-encode
└─ Other character? → Percent-encode

Common Character Encoding Table

CharacterPurposeEncoding
SpaceSeparation%20 or + (in query)
!Sub-delimiter%21 (if encoding needed)
"Quote%22
#Fragment delimiter%23 (in data)
$Sub-delimiter%24 (if encoding needed)
%Encoding marker%25
&Parameter separator%26 (in data)
'Sub-delimiter%27 (if encoding needed)
( )Sub-delimiter%28 %29
+Space/Sub-delimiter%2B (in data)
,List separator%2C (if encoding needed)
/Path separator%2F (in data)
:Scheme separator%3A (in data)
;Parameter separator%3B (if encoding needed)
=Key-value separator%3D (if encoding needed)
?Query separator%3F (in data)
@User info separator%40 (in data)
[ ]IPv6 boundaries%5B %5D

Implementation Recommendations

Encoding Implementation

def percent_encode(text, safe=''):
"""Percent-encode text"""
result = []
for char in text:
if char in safe or is_unreserved(char):
result.append(char)
else:
# UTF-8 encode and percent-encode
for byte in char.encode('utf-8'):
result.append(f'%{byte:02X}')
return ''.join(result)

def is_unreserved(char):
"""Check if character is unreserved"""
return (char.isalnum() or
char in '-._~')

Decoding Implementation

def percent_decode(text):
"""Percent-decode text"""
result = bytearray()
i = 0
while i < len(text):
if text[i] == '%' and i + 2 < len(text):
try:
byte = int(text[i+1:i+3], 16)
result.append(byte)
i += 3
except ValueError:
result.extend(text[i].encode('utf-8'))
i += 1
else:
result.extend(text[i].encode('utf-8'))
i += 1
return result.decode('utf-8', errors='replace')

Next Chapter: 3. Syntax Components - Structural components of URIs