TRE Regexp Syntax

This document describes the POSIX 1003.2 extended RE (ERE) syntax and the basic RE (BRE) syntax as implented by TRE, and the TRE extensions to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) notation is used to describe the grammar.

ERE Syntax

Alternation operator

extended-regexp ::= branch
                |   extended-regexp "|" branch

An extended regexp (ERE) is one or more branches, separated by |. An ERE matches anything that matches one or more of the branches.

Catenation of REs

branch ::= piece
       |   branch piece

A branch is one or more pieces concatenated. It matches a match for the first piece, followed by a match for the second piece, and so on.

piece ::= atom
      |   atom repeat-operator

A piece is an atom possibly followed by a repeat operator.

atom ::= "(" extended-regexp ")"
     |   bracket-expression
     |   "."
     |   assertion
     |   literal
     |   back-reference

An atom is either an ERE enclosed in parenthesis, a bracket expression, a . (period), an assertion, or a literal.

The dot (.) matches any single character. If the REG_NEWLINE compilation flag (see API manual) is specified, the newline character is not matched.

Repeat operators


repeat-operator ::= "*"
                |   "+"
                |   "?"
                |   bound
                |   "*?"
                |   "+?"
                |   "??"
                |   bound ?

An atom followed by * matches a sequence of 0 or more matches of the atom. + is similar to *, matching a sequence of 1 or more matches of the atom. An atom followed by ? matches a sequence of 0 or 1 matches of the atom.

A bound is one of the following, where m and m are unsigned decimal integers between 0 and RE_DUP_MAX:

  1. {m,n}
  2. {m,}
  3. {m}

An atom followed by [1] matches a sequence of m through n (inclusive) matches of the atom. An atom followed by [2] matches a sequence of m or more matches of the atom. An atom followed by [3] matches a sequence of exactly m matches of the atom.

Adding a ? to a repeat operator makes the subexpression minimal, or non-greedy. Normally a repeated expression is greedy, that is, it matches as many characters as possible. A non-greedy subexpression matches as few characters as possible. Note that this does not (always) mean the same thing as matching as many or few repetitions as possible.

Bracket expressions

bracket-expression ::= "[" item+ "]"
                   |   "[^" item+ "]"

A bracket expression specifies a set of characters by enclosing a nonempty list of items in brackets. Normally anything matching any item in the list is matched. If the list begins with ^ the meaning is negated; any character matching no item in the list is matched.

An item is any of the following:

To include a literal - in the list, make it either the first or last item, the second endpoint of a range, or enclose it in [. and .] to make it a collating element. To include a literal ] in the list, make it either the first item, the second endpoint of a range, or enclose it in [. and .]. To use a literal - as the first endpoint of a range, enclose it in [. and .].

Assertions

assertion ::= "^"
          |   "$"
          |   "\" assertion-character

The expressions ^ and $ are called "left anchor" and "right anchor", respectively. The left anchor matches the empty string at the beginning of the string. The right anchor matches the empty string at the end of the string. The behaviour of both anchors can be varied by specifying certain execution and compilation flags; see the API manual.

An assertion-character can be any of the following:

Literals

literal ::= "\" character
        |   ordinary-character

A literal is either an escaped or an ordinary character. An escaped character is a \ followed by any character, and matches that character. Escaping can be used to match characters which have a special meaning in regexp syntax. A \ cannot be the last character of an ERE. Escaping also allows you to include a few non-printable characters in the regular expression. These special escape sequences include:

An ordinary character is just a single character with no other significance, and matches that character. A { followed by something else than a digit is considered an ordinary character.

Back references

back-reference ::= "\" ["1"-"9"]

A back reference is a backslash followed by a single non-zero decimal digit d. It matches the same sequence of characters matched by the dth parenthesized subexpression.

Back references are not defined for POSIX EREs (for BREs they are), but many matchers, including TRE, implement back references for both EREs and BREs.

BRE Syntax

The obsolete basic regexp (BRE) syntax differs from the ERE syntax as follows: