2. Lexical Analysis of Messages
2.1. General Description
At the most basic level, a message is a series of characters. A message that is conformant with this specification is comprised of characters with values in the range 1 through 127 and interpreted as US-ASCII characters. For convenience, this document sometimes refers to this range of characters as simply "US-ASCII characters".
Note: This specification specifies that messages are made up of characters in the US-ASCII range of 1 through 127. There are other documents, specifically the MIME document series (RFC 2045, RFC 2046, RFC 2047, RFC 2049, RFC 4288, RFC 4289), that extend this specification to allow for values outside that range. Discussion of those mechanisms is not within the scope of this specification.
Messages are divided into lines of characters. A line is a series of characters that is delimited with the two characters carriage-return and line-feed; that is, the carriage return (CR) character (ASCII value 13) followed immediately by the line feed (LF) character (ASCII value 10). (The carriage-return/line-feed pair is usually written in this document as "CRLF".)
A message consists of header fields (collectively called "the header section of the message") optionally followed by a body. The header section is a sequence of lines of characters with special syntax as defined in this specification. The body is simply a sequence of characters that follows the header section and is separated from the header section by an empty line (i.e., a line with nothing preceding the CRLF).
Note: Common parlance and earlier versions of this specification use the term "header" to refer to either the entire header section or to an individual header field. To avoid ambiguity, this document does not use the terms "header" or "headers" in isolation, but instead always uses "header field" to refer to an individual field and "header section" to refer to the entire collection.
Message Structure Diagram
Message
├── Header Section
│ ├── From: [email protected] CRLF
│ ├── To: [email protected] CRLF
│ ├── Subject: Hello CRLF
│ └── Date: ... CRLF
├── Empty Line
│ └── CRLF
└── Body
├── This is the message body. CRLF
└── Second line. CRLF
2.1.1. Line Length Limits
There are two limits that this specification places on the number of characters in a line. Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF.
The 998 character limit exists due to limitations in many implementations that send, receive, or store IMF messages which simply cannot handle more than 998 characters on a line. Receiving implementations would do well to handle an arbitrarily large number of characters in a line for robustness sake. However, there are so many implementations that (in compliance with the transport requirements of RFC 5321) do not accept messages with more than 1000 characters including the CR and LF per line, it is important for implementations not to create such messages.
The more conservative 78 character recommendation is to accommodate the many implementations of user interfaces that display these messages which may truncate, or disastrously wrap, the display of more than 78 characters per line, even though such implementations are non-conformant to the intent of this specification (and that of RFC 5321 if they actually cause information to be lost). Again, even though this limitation is put on messages, it is incumbent upon implementations that display messages to handle an arbitrarily large number of characters in a line (certainly at least up to the 998 character limit) for the sake of robustness.
Line Length Limits Summary:
| Limit Type | Length (excluding CRLF) | Requirement | Reason |
|---|---|---|---|
| Hard Limit | 998 characters | MUST | Many implementations cannot handle longer lines |
| Recommended Limit | 78 characters | SHOULD | Accommodate display truncation in UIs |
Line Length Examples:
✅ Compliant with recommendation (within 78 characters):
Subject: This is a short subject line
✅ Compliant but exceeds recommendation (78-998 characters):
Subject: This is a very long subject line that exceeds the recommended 78 character limit but is still within the required 998 character maximum limit
❌ Non-compliant (exceeds 998 characters):
Subject: [Content exceeding 998 characters...]
2.2. Header Fields
Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF. A field name MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive), except colon. A field body may be composed of printable US-ASCII characters as well as the space (SP, ASCII value 32) and horizontal tab (HTAB, ASCII value 9) characters (together known as the white space characters, WSP). A field body MUST NOT include CR and LF except when used in "folding" and "unfolding", as described in Section 2.2.3. All field bodies MUST conform to the syntax described in Sections 3 and 4 of this specification.
Header Field Format:
Field-Name: Field-BodyCRLF
↑ ↑ ↑ ↑
| | | +--- Line terminator
| | +---------- Field content
| +----------------- Colon delimiter
+------------------------- Field name
Example:
From: [email protected]
Subject: Meeting Tomorrow
Date: Mon, 20 Dec 2025 10:00:00 +0800
2.2.1. Unstructured Header Field Bodies
Some field bodies in this specification are defined simply as "unstructured" (which is specified in Section 3.2.5 as any printable US-ASCII characters plus white space characters) with no further restrictions. These are referred to as unstructured field bodies. Semantically, unstructured field bodies are simply to be treated as a single line of characters with no further processing (except for "folding" and "unfolding" as described in Section 2.2.3).
Unstructured Field Examples:
Subject: This is any text I want to write
Comments: Here's a free-form comment
Characteristics:
- Free-form content, any printable ASCII characters
- No specific syntax structure required
- Only folding/unfolding needs to be processed
2.2.2. Structured Header Field Bodies
Some field bodies in this specification have a more restrictive syntax than the unstructured field bodies described above. These are referred to as "structured" field bodies. Structured field bodies are sequences of specific lexical tokens as described in Sections 3 and 4 of this specification. Many of these tokens are allowed (according to their syntax) to be introduced or end with comments (as described in Section 3.2.2) as well as white space characters, and those white space characters are subject to "folding" and "unfolding" as described in Section 2.2.3. Semantic analysis of structured field bodies is given along with their syntax.
Structured Field Examples:
From: Alice Smith ``<[email protected]>``
To: [email protected], [email protected]
Date: Mon, 20 Dec 2025 10:00:00 +0800
Characteristics:
- Must follow strict syntax rules
- Contains specific lexical tokens (e.g., email addresses, date-time)
- Can include comments and white space
- Requires parsing according to syntax
Comparison:
| Type | Syntax Strictness | Processing | Typical Examples |
|---|---|---|---|
| Unstructured | Loose, free text | As whole string | Subject, Comments |
| Structured | Strict, specific syntax | Parse by lexical tokens | From, To, Date |
2.2.3. Long Header Fields
Each header field is logically a single line of characters comprising the field name, the colon, and the field body. For convenience however, and to deal with the 998/78 character limitations per line, the field body portion of a header field can be split into a multi-line representation; this is called "folding". The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.
Folding Example:
Original header field:
Subject: This is a test
Can be represented as (folded):
Subject: This
is a test
Note: Though structured field body definitions permit folding wherever folding white space is allowed (and even within lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items, rather than within the items themselves, even if it is allowed elsewhere.
Folding Rules:
- Insertion Point: Insert CRLF before any WSP (space or tab)
- Continuation Line: Next line must start with WSP
- Recommended Position: At high-level syntax breaks (e.g., after commas)
Multiple Address Folding Example:
Recommended folding (after commas):
To: [email protected],
[email protected],
[email protected]
Not recommended but legal:
To: [email protected], bob@
example.com, [email protected]
Unfolding is the reverse process of moving this folded multiple-line representation into its single line representation. Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. An unfolded header field has no length restriction and therefore may be indeterminately long.
Unfolding Process:
Before folding (logical):
Subject: This is a very long subject line
After folding (transmission):
Subject: This is a
very long subject line
After unfolding (parsing):
Subject: This is a very long subject line
Unfolding Algorithm:
1. Identify: Find CRLF + WSP pattern
2. Remove: Delete CRLF, keep WSP
3. Result: Continuous single-line string
2.3. Body
The body of a message is simply lines of US-ASCII characters. The only two limitations on the body are:
- CR and LF MUST only occur together as CRLF; they MUST NOT appear independently in the body.
- Lines of characters in the body MUST be limited to 998 characters, and SHOULD be limited to 78 characters, excluding the CRLF.
Note: As has been mentioned, there are other documents, specifically the MIME documents (RFC 2045, RFC 2046, RFC 2049, RFC 4288, RFC 4289), that extend (and limit) this specification to allow for different kinds of message bodies. Again, these mechanisms are beyond the scope of this document.
Body Restrictions Summary:
| Restriction Type | Requirement | Description |
|---|---|---|
| Line Terminator | MUST use CRLF | CR and LF cannot appear independently |
| Line Length (Hard) | MUST ≤ 998 characters | Excluding CRLF |
| Line Length (Recommended) | SHOULD ≤ 78 characters | Excluding CRLF |
Legal Body Example:
Hello Bob,CRLF
CRLF
Let's meet tomorrow at 10am.CRLF
CRLF
Best regards,CRLF
Alice CRLF
Illegal Body Examples:
❌ Contains independent CR or LF:
Hello BobCR (missing LF)
LineLF (missing CR)
❌ Line too long (exceeds 998 characters):
[Continuous text exceeding 998 characters...]
Chapter 2 Summary
Key Concepts
- Character Set: US-ASCII (1-127)
- Line Terminator: CRLF (CR+LF)
- Message Structure: Header Section + Empty Line + Body
- Line Length: 998 characters (MUST), 78 characters (SHOULD)
Header Field Type Comparison
Unstructured Fields Structured Fields
↓ ↓
Subject: Any text From: user@domain
Comments: ... To: user1, user2
Date: Day, DD Mon YYYY
↓ ↓
Process as Parse by syntax
whole
Folding Mechanism
Long field Folding
↓ ↓
To: [email protected], To: [email protected],
[email protected] --> [email protected]
↑ ↑
Single line (logical) Multi-line (transmission)
←--- Unfolding ---
Next Steps
Chapter 2 provides a lexical overview of messages. Section 3 will define precise ABNF syntax rules for creating conformant messages.
Next: 3. Syntax
Previous: 1. Introduction