Skip to main content

1. Introduction

1. Introduction

This specification describes an interoperable regular expression (abbreviated as "regexp") flavor, I-Regexp.

I-Regexp does not provide advanced regular expression features such as capture groups, lookahead, or backreferences. It supports only a Boolean matching capability, i.e., testing whether a given regular expression matches a given piece of text.

I-Regexp supports the entire repertoire of Unicode characters (Unicode scalar values); both the I-Regexp strings themselves and the strings they are matched against are sequences of Unicode scalar values (often represented in UTF-8 encoding form [STD63] for interchange).

I-Regexp is a subset of XML Schema Definition (XSD) regular expressions [XSD-2].

This document includes guidance for converting I-Regexps for use with several well-known regular expression idioms.

The development of I-Regexp was motivated by the work of the JSONPath Working Group (WG). The WG wanted to include support for the use of regular expressions in JSONPath filters in its specification [JSONPATH-BASE], but was unable to find a useful specification for regular expressions that would be interoperable across the popular libraries.

1.1 Terminology

This document uses the abbreviation "regexp" for what is usually called a "regular expression" in programming. The term "I-Regexp" is used as a noun meaning a character string (sequence of Unicode scalar values) that conforms to the requirements in this specification; the plural is "I-Regexps".

This specification uses Unicode terminology; a good entry point is provided by [UNICODE-GLOSSARY].

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

The grammatical rules in this document are to be interpreted as ABNF, as described in [RFC5234] and [RFC7405], where the "characters" of Section 2.3 of [RFC5234] are Unicode scalar values.