Skip to main content

6. Motivation and Background

6. Motivation and Background

While regular expressions originally were intended to describe a formal language to support a Boolean matching function, they have been enhanced with parsing functions that support the extraction and replacement of arbitrary portions of the matched text. With this accretion of features, parsing-regexp libraries have become more susceptible to bugs and surprising performance degradations that can be exploited in denial-of-service attacks by an attacker who controls the regexp submitted for processing. I-Regexp is designed to offer interoperability and to be less vulnerable to such attacks, with the trade-off that its only function is to offer a Boolean response as to whether a character sequence is matched by a regexp.

6.1 Implementing I-Regexp

XSD regexps are relatively easy to implement or map to widely implemented parsing-regexp dialects, with these notable exceptions:

  • Character class subtraction. This is a very useful feature in many specifications, but it is unfortunately mostly absent from parsing-regexp dialects. Thus, it is omitted from I-Regexp.

  • Multi-character escapes. \d, \w, \s and their uppercase complement classes exhibit a large amount of variation between regexp flavors. Thus, they are omitted from I-Regexp.

  • Not all regexp implementations support access to Unicode tables that enable executing constructs such as \p{Nd}, although the \p/\P feature in general is now quite widely available. While, in principle, it is possible to translate these into character-class matches, this also requires access to those tables. Thus, regexp libraries in severely constrained environments may not be able to support I-Regexp conformance.