2. Lexical Analysis of Messages

2.1. General Description

At the most basic level, a message is a series of characters. A message that is conformant with this specification is comprised of characters with values in the range 1 through 127 and interpreted as US-ASCII characters. For convenience, this document sometimes refers to this range of characters as simply "US-ASCII characters".

Note: This specification specifies that messages are made up of characters in the US-ASCII range of 1 through 127. There are other documents, specifically the MIME document series (RFC 2045, RFC 2046, RFC 2047, RFC 2049, RFC 4288, RFC 4289), that extend this specification to allow for values outside that range. Discussion of those mechanisms is not within the scope of this specification.

Messages are divided into lines of characters. A line is a series of characters that is delimited with the two characters carriage-return and line-feed; that is, the carriage return (CR) character (ASCII value 13) followed immediately by the line feed (LF) character (ASCII value 10). (The carriage-return/line-feed pair is usually written in this document as "CRLF".)

A message consists of header fields (collectively called "the header section of the message") optionally followed by a body. The header section is a sequence of lines of characters with special syntax as defined in this specification. The body is simply a sequence of characters that follows the header section and is separated from the header section by an empty line (i.e., a line with nothing preceding the CRLF).

Note: Common parlance and earlier versions of this specification use the term "header" to refer to either the entire header section or to an individual header field. To avoid ambiguity, this document does not use the terms "header" or "headers" in isolation, but instead always uses "header field" to refer to an individual field and "header section" to refer to the entire collection.

Message Structure Diagram

Message
├── Header Section
│   ├── From: [email protected] CRLF
│   ├── To: [email protected] CRLF
│   ├── Subject: Hello CRLF
│   └── Date: ... CRLF
├── Empty Line
│   └── CRLF
└── Body
    ├── This is the message body. CRLF
    └── Second line. CRLF

2.1.1. Line Length Limits

There are two limits that this specification places on the number of characters in a line. Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF.

The 998 character limit exists due to limitations in many implementations that send, receive, or store IMF messages which simply cannot handle more than 998 characters on a line. Receiving implementations would do well to handle an arbitrarily large number of characters in a line for robustness sake. However, there are so many implementations that (in compliance with the transport requirements of RFC 5321) do not accept messages with more than 1000 characters including the CR and LF per line, it is important for implementations not to create such messages.

The more conservative 78 character recommendation is to accommodate the many implementations of user interfaces that display these messages which may truncate, or disastrously wrap, the display of more than 78 characters per line, even though such implementations are non-conformant to the intent of this specification (and that of RFC 5321 if they actually cause information to be lost). Again, even though this limitation is put on messages, it is incumbent upon implementations that display messages to handle an arbitrarily large number of characters in a line (certainly at least up to the 998 character limit) for the sake of robustness.

Line Length Limits Summary:

Limit Type	Length (excluding CRLF)	Requirement	Reason
Hard Limit	998 characters	MUST	Many implementations cannot handle longer lines
Recommended Limit	78 characters	SHOULD	Accommodate display truncation in UIs

Line Length Examples:

✅ Compliant with recommendation (within 78 characters):
Subject: This is a short subject line

✅ Compliant but exceeds recommendation (78-998 characters):
Subject: This is a very long subject line that exceeds the recommended 78 character limit but is still within the required 998 character maximum limit

❌ Non-compliant (exceeds 998 characters):
Subject: [Content exceeding 998 characters...]

2.2. Header Fields

Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF. A field name MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive), except colon. A field body may be composed of printable US-ASCII characters as well as the space (SP, ASCII value 32) and horizontal tab (HTAB, ASCII value 9) characters (together known as the white space characters, WSP). A field body MUST NOT include CR and LF except when used in "folding" and "unfolding", as described in Section 2.2.3. All field bodies MUST conform to the syntax described in Sections 3 and 4 of this specification.

Header Field Format:

Field-Name: Field-BodyCRLF
    ↑       ↑      ↑      ↑
    |       |      |      +--- Line terminator
    |       |      +---------- Field content
    |       +----------------- Colon delimiter
    +------------------------- Field name

Example:

From: [email protected]
Subject: Meeting Tomorrow
Date: Mon, 20 Dec 2025 10:00:00 +0800

2.2.1. Unstructured Header Field Bodies

Some field bodies in this specification are defined simply as "unstructured" (which is specified in Section 3.2.5 as any printable US-ASCII characters plus white space characters) with no further restrictions. These are referred to as unstructured field bodies. Semantically, unstructured field bodies are simply to be treated as a single line of characters with no further processing (except for "folding" and "unfolding" as described in Section 2.2.3).

Unstructured Field Examples:

Subject: This is any text I want to write
Comments: Here's a free-form comment

Characteristics:

Free-form content, any printable ASCII characters
No specific syntax structure required
Only folding/unfolding needs to be processed

2.2.2. Structured Header Field Bodies

Some field bodies in this specification have a more restrictive syntax than the unstructured field bodies described above. These are referred to as "structured" field bodies. Structured field bodies are sequences of specific lexical tokens as described in Sections 3 and 4 of this specification. Many of these tokens are allowed (according to their syntax) to be introduced or end with comments (as described in Section 3.2.2) as well as white space characters, and those white space characters are subject to "folding" and "unfolding" as described in Section 2.2.3. Semantic analysis of structured field bodies is given along with their syntax.

Structured Field Examples:

From: Alice Smith ``&lt;[email protected]&gt;``
To: [email protected], [email protected]
Date: Mon, 20 Dec 2025 10:00:00 +0800

Characteristics:

Must follow strict syntax rules
Contains specific lexical tokens (e.g., email addresses, date-time)
Can include comments and white space
Requires parsing according to syntax

Comparison:

Type	Syntax Strictness	Processing	Typical Examples
Unstructured	Loose, free text	As whole string	Subject, Comments
Structured	Strict, specific syntax	Parse by lexical tokens	From, To, Date

2.2.3. Long Header Fields

Each header field is logically a single line of characters comprising the field name, the colon, and the field body. For convenience however, and to deal with the 998/78 character limitations per line, the field body portion of a header field can be split into a multi-line representation; this is called "folding". The general rule is that wherever this specification allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP.

Folding Example:

Original header field:

Subject: This is a test

Can be represented as (folded):

Subject: This
 is a test

Note: Though structured field body definitions permit folding wherever folding white space is allowed (and even within lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items, rather than within the items themselves, even if it is allowed elsewhere.

Folding Rules:

Insertion Point: Insert CRLF before any WSP (space or tab)
Continuation Line: Next line must start with WSP
Recommended Position: At high-level syntax breaks (e.g., after commas)

Multiple Address Folding Example:

Recommended folding (after commas):
To: [email protected],
    [email protected],
    [email protected]

Not recommended but legal:
To: [email protected], bob@
    example.com, [email protected]

Unfolding is the reverse process of moving this folded multiple-line representation into its single line representation. Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. An unfolded header field has no length restriction and therefore may be indeterminately long.

Unfolding Process:

Before folding (logical):
Subject: This is a very long subject line

After folding (transmission):
Subject: This is a
 very long subject line

After unfolding (parsing):
Subject: This is a very long subject line

Unfolding Algorithm:

Identify: Find CRLF + WSP pattern
Remove: Delete CRLF, keep WSP
Result: Continuous single-line string

2.3. Body

The body of a message is simply lines of US-ASCII characters. The only two limitations on the body are:

CR and LF MUST only occur together as CRLF; they MUST NOT appear independently in the body.
Lines of characters in the body MUST be limited to 998 characters, and SHOULD be limited to 78 characters, excluding the CRLF.

Note: As has been mentioned, there are other documents, specifically the MIME documents (RFC 2045, RFC 2046, RFC 2049, RFC 4288, RFC 4289), that extend (and limit) this specification to allow for different kinds of message bodies. Again, these mechanisms are beyond the scope of this document.

Body Restrictions Summary:

Restriction Type	Requirement	Description
Line Terminator	MUST use CRLF	CR and LF cannot appear independently
Line Length (Hard)	MUST ≤ 998 characters	Excluding CRLF
Line Length (Recommended)	SHOULD ≤ 78 characters	Excluding CRLF

Legal Body Example:

Hello Bob,CRLF
CRLF
Let's meet tomorrow at 10am.CRLF
CRLF
Best regards,CRLF
Alice CRLF

Illegal Body Examples:

❌ Contains independent CR or LF:
Hello BobCR (missing LF)
LineLF (missing CR)

❌ Line too long (exceeds 998 characters):
[Continuous text exceeding 998 characters...]

Chapter 2 Summary

Key Concepts

Character Set: US-ASCII (1-127)
Line Terminator: CRLF (CR+LF)
Message Structure: Header Section + Empty Line + Body
Line Length: 998 characters (MUST), 78 characters (SHOULD)

Header Field Type Comparison

Unstructured Fields         Structured Fields
        ↓                           ↓
Subject: Any text          From: user@domain
Comments: ...              To: user1, user2
                           Date: Day, DD Mon YYYY
        ↓                           ↓
   Process as              Parse by syntax
      whole

Folding Mechanism

Long field                       Folding
    ↓                               ↓
To: [email protected],      To: [email protected],
    [email protected]   -->       [email protected]
    ↑                               ↑
Single line (logical)        Multi-line (transmission)
     ←--- Unfolding ---

Next Steps

Chapter 2 provides a lexical overview of messages. Section 3 will define precise ABNF syntax rules for creating conformant messages.

Next: 3. Syntax

Previous: 1. Introduction

2.1. General Description​

Message Structure Diagram​

2.1.1. Line Length Limits​

2.2. Header Fields​

2.2.1. Unstructured Header Field Bodies​

2.2.2. Structured Header Field Bodies​

2.2.3. Long Header Fields​

2.3. Body​

Chapter 2 Summary​

Key Concepts​

Header Field Type Comparison​

Folding Mechanism​

Next Steps​