Skip to main content

3. I-Regexp Syntax

3. I-Regexp Syntax

An I-Regexp MUST conform to the ABNF specification in Figure 1.

i-regexp = branch *( "|" branch )
branch = *piece
piece = atom [ quantifier ]
quantifier = ( "*" / "+" / "?" ) / range-quantifier
range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"
QuantExact = 1*%x30-39 ; '0'-'9'

atom = NormalChar / charClass / ( "(" i-regexp ")" )
NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'
/ %x40-5A ; '@'-'Z'
/ %x5E-7A ; '^'-'z'
/ %x7E-D7FF ; skip surrogate code points
/ %xE000-10FFFF )
charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
SingleCharEsc = "\" ( %x28-2B ; '('-'+'
/ "-" / "." / "?" / %x5B-5E ; '['-'^'
/ %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
)
charClassEsc = catEsc / complEsc
charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
/ %x5E-D7FF ; skip surrogate code points
/ %xE000-10FFFF ) / SingleCharEsc
catEsc = %s"\p{" charProp "}"
complEsc = %s"\P{" charProp "}"
charProp = IsCategory
IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
Symbols / Others
Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]
Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
/ %s"i" / %s"o" / %s"s" ) ]
Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]

Figure 1: I-Regexp Syntax in ABNF

As an additional restriction, charClassExpr is not allowed to match [^], which, according to this grammar, would parse as a positive character class containing the single character ^.

This is essentially an XSD regexp without:

  • character class subtraction,
  • multi-character escapes such as \s, \S, and \w, and
  • Unicode blocks.

An I-Regexp implementation MUST be a complete implementation of this limited subset. In particular, full support for the Unicode functionality defined in this specification is REQUIRED. The implementation:

  • MUST NOT limit itself to 7- or 8-bit character sets such as ASCII, and
  • MUST support the Unicode character property set in character classes.

3.1 Checking Implementations

A checking I-Regexp implementation is one that checks a supplied regexp for compliance with this specification and reports any problems. Checking implementations give their users confidence that they didn't accidentally insert syntax that is not interoperable, so checking is RECOMMENDED. Exceptions to this rule may be made for low-effort implementations that map I-Regexp to another regexp library by simple steps such as performing the mapping operations discussed in Section 5. Here, the effort needed to do full checking might dwarf the rest of the implementation effort. Implementations SHOULD document whether or not they are checking.

Specifications that employ I-Regexp may want to define in which cases their implementations can work with a non-checking I-Regexp implementation and when full checking is needed, possibly in the process of defining their own implementation classes.