Category talk:Wren-pattern

From Rosetta Code

Wren patterns

Wren doesn't have direct access to a regular expression library (but see Wren-regex section below) and writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations.

I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory.

Pattern

A pattern is a Wren string (i.e. a sequence of Unicode code-points) which in addition to normal characters can contain one or more of the following: character classes, extended classes, complements, either-cases, multiples, minima, ranges, optionals, captures and back-references.

Normal characters, character classes, extended classes, complements and either-cases are referred to collectively as singles since they always match a single character if they can.

Multiples, minima, ranges and optionals are referred to as quantifiers as they specify how many times a single needs to be repeated for a match to occur.

Note that quantifiers are always greedy; they match as many times as they can, never take into account whether the next symbol in the pattern would match and never backtrack to try and change an unsuccessful match into a successful one. Clearly, this needs to be borne in mind when using them in a pattern. However, there is a way to make them act lazily which will be discussed later.

Captures match a group of 0 or more singles and may include alternatives which are considered in turn until a match occurs. Finally, back-references refer to text matched by a previous capture.

Twelve meta-characters are needed and have been chosen so as not to need escaping in Wren itself (which rules out \ and %) nor be commonly used punctuation characters in English text.

/ character class
& extended class
^ complement
@ either-case
= multiple
+ minimum
# range
~ optional
[ capture start
] capture end
| separates alternatives within a capture
$ back-reference

When used literally, all meta-characters must be escaped by preceding them with / (or &) unless otherwise noted below.

There are 4 kinds of patterns depending on whether they match: anywhere within a string, from the start of the string, up to the end of the string or the whole string.

Character classes

These are single letters preceded by the / meta-character. The following always match exactly one character:

class   name       contents
 a   alphabetic   a-z A-z
 b   binary       0-1
 c   control      \x00-\x1f and \x7f (codes 0-31 and 127)
 d   decimal      0-9
 e   exponent     0-9 and .+-eE
 f   float        0-9 and .
 g   graphic      all printable ASCII except space (codes 33-126)
 h   hex          0-9 a-f A-F
 i   i class      0-2 (but can be redefined) 
 j   j class      0-3 (but can be redefined)
 k   k class      0-4 (but can be redefined)
 l   lower case   a-z
 m   m class      a-z 0-9
 n   n class      +- (sign)
 o   octal        0-7
 p   punctuation  as graphic but excluding alphabetic and decimal
 q   quote        ", ', ` (double, single or back quotes)
 r   r class      all ASCII (codes 0-127)
 s   space        \t, \n, \f, \v, \r and space
 t   t class      punctuation plus space
 u   upper case   A-Z
 v   v class      A-Z 0-9
 w   word         a-z A-Z 0-9
 x   x class      a-z A-Z 0-9 and _ (word plus underscore)
 y   y class      a-z A-Z 0-9 and _'- (x class plus apostrophe and hyphen)
 z   z class      any character at all including non-ASCII

Although the standard character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits.

An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z normally just matches Z itself as it's not possible, of course, to match no characters. However, it does have a special meaning for 'lazy' and 'quantified group' matches which will be covered later.

/ followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as literal characters without their special meaning. So /| is a literal vertical bar.

Extended classes

These are single letters preceded by the & meta-character. They extend character classes to include controls, punctuation and upper/lower case letters with code-points in the range 128-255. This is useful when matching text written in Western European languages other than English. The following always match exactly one character.

class   name       contents
 a   alphabetic   a-z A-z and codes: 181, 192-214, 216-222, 223-246 and 248-255. 
 c   control      \x00-\x1f, \x7f and \x80-\x9f (codes 0-31, 127 and 128-159)
 g   graphic      all printable except space and non-breaking space (codes 33-126 and 161-255)  
 l   lower case   a-z and codes: 181, 223-246, 248-255
 m   m class      as class l plus 0-9
 n   n class      +-± (sign plus code 177)
 p   punctuation  as graphic but excluding alphabetic and decimal
 q   quote        ", ', `, «, » (double, single, back and double-angle quotes)
 r   r class      all extended (codes 0-255)
 s   space        \t, \n, \f, \v, \r, space and non-breaking space (code 160)
 t   t class      punctuation plus space
 u   upper case   A-Z and codes: 192-214, 216-222
 v   v class      as class u plus 0-9
 w   word         as class a plus 0-9
 x   x class      as class w plus _ (word plus underscore)
 y   y class      as class x plus '- and code 173 (x class plus apostrophe, hyphen and soft hyphen)

The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign.

& followed by any other letter or character behaves exactly the same as if it were preceded by / except that &Z has a special meaning for 'lazy' and 'quantified group' matches which will be covered later.

Complements

A complement is any literal character (but not other kind of single) preceded by ^. It matches any character other than the literal character itself. For example, ^a matches any character other than a.

When followed by any meta-character ^ treats them as literals. So ^/ matches any literal character except / and ^^ matches anything other than ^.

Either-cases

An either-case is any literal character (but not other kind of single) preceded by @.

If it's a lower case letter, it matches either the character itself or its upper case equivalent provided that both have code-points less than 256. For example @a matches either a or A.

If it's an upper case letter, it matches any character other than the character itself and its lower case equivalent provided that both have code-points less than 256. For example @A does not match A and a but would match b.

Non-letters simply match themselves.

Multiples

A multiple specifies the exact number of times the following single should be repeated for a match to occur. It is represented by = followed by a digit then by the single to be repeated. If the digit is between 2 and 9 then this is the number of repeats. If the digit is 0 or 1 then the number of repeats is 10 or 11 respectively. For example =3x matches 3 consecutive x's and =2/d matches 2 consecutive digits. (With regard to the last example, it would be just as easy and perhaps clearer to write /d/d).

Minima

A minimum specifies the minimum number of times the following single should be repeated for a match to occur. It is represented by + followed by a digit then by the single to be repeated. The digit can be between 0 and 9 and is the minimum number of repetitions. So +0/w matches 0 or more word characters and +1a matches one or more a's.

Ranges

A range specifies the range of times the following single should be repeated for a match to occur. It is represented by # followed by two digits and then the single to be repeated.

The first digit must be between 0 and 8.

The second digit must be between 1 and 9 and must be more than the first digit.

For example #13b matches a sequence of between 1 and 3 b's and #04C matches between 0 and 4 C's.

Optionals

An optional matches the following single zero or one times. It is represented by ~ followed by the single to be matched. For example ~/d matches zero or one digits.

It is shorthand for the range #01 but is so common that it was felt it should have its own meta-character.

Captures

These are sequences of 1 or more mini-patterns to be matched in order. At least one of the mini-patterns needs to be matched successfully for the capture as a whole to match. A mini-pattern can contain any of the elements which a normal pattern can contain except another capture and may be empty. Captures cannot overlap each other.

Captures begin with [ and end with ]. Individual mini-patterns are separated by |. For example [cat|dog] matches either cat or dog and [/d/d|/a/a|/p/p] matches two digits, letters or punctuation symbols.

Captures cannot be qualified by a quantifier and are stored separately with each match. There can be a maximum of 9 captures within the overall pattern.

Back-references

These refer back to previous captures numbered from 1 to 9 preceded by a $. So $1 refers to the first capture's text.

$0 refers to the whole of the text matched so far (before the current capture started if the back-reference is within a capture mini-pattern). It is an error to refer back to captures which have not yet taken place or been completed.

Back-references cannot be qualified by a quantifier but can appear within a mini-pattern of a subsequent capture.

Replacement patterns

These are patterns which are used in the appropriate methods as replacements for a matching pattern. They are treated as normal text except that they can contain back-references ($0 always refers to the whole match) and a literal $ must be escaped with $$ (not the usual /$).

Examples

"/d/d/d/d-/d/d-/d/d" matches a date in yyyy-mm-dd format. It could also be written: "=4/d-/d/d-/d/d".

"/u/u=9/v/d" matches an ISIN though doesn't of course validate the check-digit.

If we define:

i = "BCDFGHJKLMNPQRSTVWXYZ" (all letters excluding vowels)

j = i + "0123456789" (as i plus digits)

"/i=5/j/d" should then match a new SEDOL though again without validating the check digit.

"The quick brown [fox|c/l/l]" matches: The quick brown fox|cat|cow.

Lazy matching

The pattern "<a>[+0/z]<//a>" fails to match the string "<a>b < c</a>" because +0/z is 'greedy' and matches all characters to the end of the string including the final "</a>".

If you change the pattern to "<a>[+0^<]<//a>" it still fails to match the string because of the presence of '<' in the text between the tags.

However, if you change the pattern to "<a>[+0/Z]<//a>" and use the 'findLazy' method with a 't' parameter of "</a>", then a match will occur and the captured text will be 'b < c'. What we did here was to replace '/z' or '^<' in the pattern with '/Z' which, in a lazy method, means match any characters other than the parameter string 't'. In the non-lazy 'find' method '/Z' would just match a literal 'Z' as noted earlier.

The 'findLazy2' method works in a similar fashion but can carry out 2 lazy matches by using '/Z' for the first and '&Z' for the second which match any characters other than the parameter strings 't' and 'u' respectively.

These methods do not actually change the 'greedy' nature of the engine but use a hack (replacing text with rarely used control characters and back again after matching) to simulate lazy matching to a limited extent.

Quantification of groups of characters

Quantifiers always qualify singles. Any departure from this is an error and groups of characters cannot therefore be directly quantified.

There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. However, this sort of approach clearly has its limitations.

A different and usually better approach is to use the 'findWithGroup' or 'findWithGroup2' methods. These work in an analogous way to the 'findLazy' and 'findLazy2' methods. However, this time '/Z' and '&Z' respectively match the parameter strings 't' and 'u' themselves rather than any characters other than these strings and we can therefore quantify them as though they were 'singles'.

Wren-regex

Since this module was written, I've created a Wren-regex module which wraps Go's 'regexp' package. Whilst this addresses some shortcomings in Wren-pattern, it requires a special Go executable to use it and doesn't therefore work with Wren-cli.

Source code

/* Module "pattern.wren" */

/* Match represents a single successful match made by methods in the Pattern class.
   Match objects are immutable.
*/
class Match {
    // Constructs a Match object from the text of the match, its starting index as a codepoint offset
    // from the start of the string and its capture list. This is a private constructor
    // intended to be called from the Pattern class as there should be no need for the user
    // to construct Match objects directly.
    construct new_(text, index, captures) {
        if (!(text is String)) Fiber.abort("Match text must be a string.")
        if (!((index is Num) && index.isInteger && index >= 0)) {
            Fiber.abort("Match index must be a non-negative integer.")
        }
        if (!(captures is List)) Fiber.abort("Match captures must be a list of Capture objects.")
        _text = text
        _index = index
        _captures = captures
    }

    // Properties.
    text     { _text }                                // the text of the match
    index    { _index }                               // its starting index (codepoints)
    length   { _text.count }                          // its length
    span     { [_index, index + length - 1] }         // a list of its starting and ending indices
    captures { _captures.toList }                     // the Capture objects associated with the match
    capsText { _captures.map { |c| c.text }.toList }  // a list of each capture's text property

    // String representation (excluding captures)
    toString { "{ text = %(_text), index = %(_index), length = %(length) }" }
}

/* Capture represents a single successful capture made by methods in the Pattern class.
   Capture objects are immutable.
*/
class Capture {
    // Constructs a capture object from the text of the capture and its starting index
    // as a codepoint offset from the start of the string. This is a private constructor
    // intended to be called from the Pattern class as there should be no need for the user
    // to construct Capture objects directly.
    construct new_(text, index) {
        if (!(text is String)) Fiber.abort("Capture text must be a string.")
        if (!((index is Num) && index.isInteger && index >= 0)) {
            Fiber.abort("Capture index must be a non-negative integer.")
        }
        _text = text
        _index = index
    }

    // Properties.
    text     { _text }                        //  the text of the capture
    index    { _index }                       //  its starting index (codepoints)
    length   { _text.count }                  //  its length
    span     { [_index, index + length - 1] } //  a list of its starting and ending indices

    // String representation.
    toString { "{ text = %(_text), index = %(_index), length = %(length) }" }
}

/* Pattern represents a pattern to be used for matching characters within a string.
   A Pattern object is immutable.
*/
class Pattern {
    // Constant pattern types.
    static within { 0 }   // matches anywhere within a string
    static start  { 1 }   // matches only at the start of a string
    static end    { 2 }   // matches only at the end of a string
    static whole  { 3 }   // matches the whole of a string

    static types { ["within", "start", "end", "whole"] }

    // Constants to help construct user-defined patterns.
    static lower  { "abcdefghijklmnopqrstuvwxyz" }
    static upper  { "ABCDEFGHIJKLMNOPQRSTUVWXYZ" }
    static letter { lower + upper }
    static digit  { "0123456789" }
    static alpha  { letter + digit }

    // Private method to initialize function tables and back-reference symbols.
    static init_() {
        // character classes
        __fns = [
            Fn.new { |c| (c >= 65 && c <= 90) || (c >= 97 && c <= 122) },                         // a
            Fn.new { |c|  c == 48 || c == 49 },                                                   // b
            Fn.new { |c|  c <  32 || c == 127 },                                                  // c
            Fn.new { |c|  c >= 48 && c <= 57 },                                                   // d
            Fn.new { |c| (c >= 48 && c <= 57) || ".+-Ee".codePoints.contains(c) },                // e
            Fn.new { |c| (c >= 48 && c <= 57) || c == 46 },                                       // f
            Fn.new { |c|  c >= 33 && c < 127 },                                                   // g
            Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 70) || (c >= 97 && c <= 102) }, // h
            Fn.new { |c, p|  p.i.codePoints.contains(c) },                                        // i
            Fn.new { |c, p|  p.j.codePoints.contains(c) },                                        // j
            Fn.new { |c, p|  p.k.codePoints.contains(c) },                                        // k
            Fn.new { |c|  c >= 97 && c <= 122 },                                                  // l
            Fn.new { |c| (c >= 97 && c <= 122) || (c >= 48 && c <= 57) },                         // m
            Fn.new { |c|  c == 43 || c == 45 },                                                   // n
            Fn.new { |c|  c >= 48 && c <= 55 },                                                   // o
            Fn.new { |c| (c >= 33 && c < 127) && !__fns[22].call(c) },                            // p
            Fn.new { |c|  c == 34 || c == 39 || c == 96 },                                        // q
            Fn.new { |c|  c < 128 },                                                              // r
            Fn.new { |c|  c == 32 || (c >= 9 && c <= 13) },                                       // s
            Fn.new { |c| ((c >= 9 && c <= 13) || (c >= 32 && c < 127)) && !__fns[22].call(c) },   // t
            Fn.new { |c|  c >= 65 && c <= 90 },                                                   // u
            Fn.new { |c| (c >= 65 && c <= 90) || (c >= 48 && c <= 57) },                          // v
            Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 90) || (c >= 97 && c <= 122) }, // w
            Fn.new { |c|  __fns[22].call(c) || c == 95 },                                         // x
            Fn.new { |c|  __fns[22].call(c) || c == 95 || c == 39 || c == 45 },                   // y
            Fn.new { |c|  true }                                                                  // z
        ]
                                                          
        // extended classes
        __fns2 = [
            Fn.new { |c|  __fns2[11].call(c) || __fns2[20].call(c) },                               // a
            Fn.new { |c|  c == 48 || c == 49 },                                                     // b
            Fn.new { |c|  c <  32 || (c >=  127 && c < 160) },                                      // c
            Fn.new { |c|  c >= 48 && c <= 57 },                                                     // d
            Fn.new { |c| (c >= 48 && c <= 57) || ".+-Ee".codePoints.contains(c) },                  // e
            Fn.new { |c| (c >= 48 && c <= 57) || c == 46 },                                         // f
            Fn.new { |c| (c >= 33 && c < 127) || (c >= 161 && c <= 255) },                          // g
            Fn.new { |c| (c >= 48 && c <= 57) || (c >= 65 && c <= 70) || (c >= 97 && c <= 102) },   // h
            Fn.new { |c, p|  p.i.codePoints.contains(c) },                                          // i
            Fn.new { |c, p|  p.j.codePoints.contains(c) },                                          // j
            Fn.new { |c, p|  p.k.codePoints.contains(c) },                                          // k
            Fn.new { |c| (c >= 97 && c <= 122) || c == 181 || (c >= 223 && c <= 255 && c != 247) }, // l  
            Fn.new { |c|  __fns2[11].call(c) || (c >= 48 && c <= 57) },                             // m
            Fn.new { |c|  c == 43 || c == 45 || c == 177 },                                         // n
            Fn.new { |c|  c >= 48 && c <= 55 },                                                     // o
            Fn.new { |c|  __fns2[6].call(c) && !__fns2[22].call(c) },                               // p
            Fn.new { |c|  c == 34 || c == 39 || c == 96 || c == 171 || c == 187 },                  // q
            Fn.new { |c|  c < 256 },                                                                // r
            Fn.new { |c|  c == 32 || (c >= 9 && c <= 13) || c == 160 },                             // s
            Fn.new { |c|  __fns2[15].call || __fns2[18].call },                                     // t
            Fn.new { |c| (c >= 65 && c <= 90) || (c >= 192 && c <= 222 && c != 215) },              // u
            Fn.new { |c|  __fns2[20].call(c) || (c >= 48 && c <= 57) },                             // v
            Fn.new { |c| (c >= 48 && c <= 57) || __fns2[0].call(c) },                               // w
            Fn.new { |c|  __fns2[22].call(c) || c == 95 },                                          // x
            Fn.new { |c|  __fns2[22].call(c) || c == 95 || c == 39 || c == 45 || c == 173 },        // y
            Fn.new { |c|  true }                                                                    // z
        ]

        // back reference symbols
        __backRefs = ["$1", "$2", "$3", "$4", "$5", "$6", "$7", "$8", "$9"]
    }

    // Returns a list of the text properties of each match in matches.
    static matchesText(matches) { matches.map { |m| m.text }.toList }

    // Returns whether a pattern string is valid or not.
    static validate(pattern) {
        if (!((pattern is String) && pattern != "")) return false
        return !Fiber.new {
            validate_(pattern)
        }.try()
    }

    // Private worker method to validate and tokenize a pattern and get its minimum matching length.
    static validate_(pattern) {
        var min = 0                        // minimum length
        var pc = pattern.codePoints.toList // pattern codepoints
        var lpc = pc.count                 // pattern length
        var i = 0                          // codepoint index
        var cap = false                    // whether within a capture
        var captures = []                  // stores min length for each capture
        var curMin = 0                     // minimum length of current mini-pattern
        var capMin = 0                     // minimum length of current capture
        var c = 0                          // current codepoint
        var tokens = []                    // tokenize pattern to make subsequent matching easier

        // Increments min or curMin.
        var increment = Fn.new { |imin|
            if (!cap) {
                min = min + imin
            } else {
                curMin = curMin + imin
            }
        }

        // Handles the slash or ampersand metacharacters.
        var slashOrAmp = Fn.new { |reps|
            i = i + 1
            if (i == lpc) Fiber.abort("Invalid pattern - missing character at index %(i).")
            var d = pc[i]
            if (reps > 0) increment.call(reps)
            if (d >= 97 && d <= 122) {
                tokens.add(-c)
                tokens.add(d - 97)
            } else if (d >= 65 && d <= 89) {
                tokens.add(-c)
                tokens.add(d - 39)
            } else {
                tokens.add(d)
            }
        }

        // Handles the caret or at sign metacharacters.
        var caretOrAt = Fn.new { |reps|
            i = i + 1
            if (i == lpc) Fiber.abort("Invalid pattern - missing character at index %(i).")
            if (reps > 0) increment.call(reps)
            var d = pc[i]
            if (c == 94) {
                tokens.add(-c)
                tokens.add(d)
            } else {
                if (__fns2[20].call(d)) {
                    tokens.add(-c)
                    tokens.add(d)
                    tokens.add(d + 32)
                } else if (__fns2[11].call(d) && d != 181 && d != 223 && d != 255) {
                    tokens.add(-c)
                    tokens.add(d)
                    tokens.add(d - 32)
                } else {
                    tokens.add(d)
                }
            }
        }

        while (i < lpc) {
            c = pc[i]                          // current codepoint
            if (c == 47 || c == 38) {          // slash = character class, ampersand = extended class
                slashOrAmp.call(1)
            } else if (c == 94 || c == 64 ) {  // caret = complement, at sign = either-case
                caretOrAt.call(1)
            } else if (c == 61 || c == 43 || c == 35) { // multiple, minimum or range
                i = i + 1
                if (i == lpc) Fiber.abort("Invalid pattern - missing digit at index %(i).")
                var d = pc[i]                  // get the next codepoint
                if (d < 48 || d > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
                tokens.add(-c)
                var reps = d - 48
                tokens.add(reps)
                if (c == 61) {        // equals sign = multiple
                    if (reps < 2) {
                        reps = reps + 10
                        tokens[-1] = reps
                    }
                } else if (c == 35) { // hash = range
                    if (reps == 9) {
                        Fiber.abort("Invalid pattern - first digit cannot exceed eight at index %(i).")
                    }
                    i = i + 1
                    if (i == lpc) Fiber.abort("Invalid pattern - missing second digit at index %(i).")
                    var e = pc[i]
                    if (e < 48 || e > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
                    if (e <= d) {
                        Fiber.abort("Invalid pattern - seocond digit must be greater than first at index %(i).")
                    }
                    tokens.add(e - 48)
                }
                i = i + 1
                if (i == pc.count) {
                    Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
                }
                c = pc[i]   // get the next codepoint
                if (c == 47 || c == 38) {
                    slashOrAmp.call(reps)
                } else if (c == 94 || c == 64) {
                    caretOrAt.call(reps)
                } else if ("=+#~[]|$".codePoints.contains(c)) {
                    Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
                } else {
                    increment.call(reps)
                    tokens.add(c)
                }
            } else if (c == 126) { // tilde == optional
                i = i + 1
                if (i == lpc) Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
                tokens.add(-35) // use range for tokenization purposes
                tokens.add(0)
                tokens.add(1)
                c = pc[i]  // get the next codepoint
                if (c == 47 || c == 38) {
                    slashOrAmp.call(0)
                } else if (c == 94 || c == 64) {
                    caretOrAt.call(0)
                } else if ("=+#~[]|$".codePoints.contains(c)) {
                    Fiber.abort("Invalid pattern - missing 'single' at index %(i).")
                } else {
                    tokens.add(c)
                }
            } else if (c == 91) { // left square bracket = capture opening
                if (cap) Fiber.abort("Invalid pattern - orphan [ found.")
                cap = true
                curMin = 0
                capMin = Num.largest
                tokens.add(-91)
            } else if (c == 124) { // vertical bar = end of mini-pattern
                if (!cap) Fiber.abort("Invalid pattern - orphan | found.")
                if (curMin < capMin) capMin = curMin
                curMin = 0
                tokens.add(-124)
            } else if (c == 93) { // right square bracket = capture end
                if (!cap) Fiber.abort("Invalid pattern - orphan ] found.")
                if (curMin < capMin) capMin = curMin
                cap = false
                increment.call(capMin)
                captures.add(capMin)
                tokens.add(-93)
            } else if (c == 36) { // dollar sign = back-reference
                i = i + 1
                if (i == lpc) Fiber.abort("Invalid pattern - missing digit at index %(i).")
                c = pc[i]       // get the next codepoint
                if (c < 48 || c > 57) Fiber.abort("Invalid pattern - non-digit found at index %(i).")
                c = c - 48
                if (c == 0) {
                    increment.call(min)
                } else if (c > captures.count) {
                    Fiber.abort("Invalid pattern - back-reference exceeds capture count at %(i).")
                } else {
                    increment.call(captures[c-1])
                }
                tokens.add(-36)
                tokens.add(c)
            } else { // normal character
                increment.call(1)
                tokens.add(c)
            }
            i = i + 1
        }
        if (cap) Fiber.abort("Invalid pattern - capture unfinished at %(i).")
        return [min, tokens]
    }

    // Private worker method.
    // Looks for a pattern match from a list of codepoints, 'codes', starting from index 'start'.
    // 'sc' is the number of codepoints in 'codes'.
    // Returns a Match object if a match is found or null otherwise.
    match_(codes, start, sc) {
        var tokens = _tokens.toList     // use a copy as we might change it
        var tc = tokens.count           // tokens length
        var ti = 0                      // tokens index
        var t = tokens[ti]              // current token
        var si = start                  // string codepoints index        
        var c = -1                      // string current codepoint
        var consumed = false            // whether current codepoint has been consumed
        var cap = false                 // whether within a capture
        var captures = []               // stores captures
        var wm = ""                     // matched so far in string as a whole
        var cm = ""                     // matched so far in current capture
        var ci = 0                      // string index at which capture started

        if (si < sc) c = codes[si]

        // Consume current character.
        var consume = Fn.new {
            if (!cap) {
                wm = wm + String.fromCodePoint(c)
            } else {
                cm = cm + String.fromCodePoint(c)
            }
            consumed = true
        }

        // Moves token index, where necessary, to next metacharacter
        var moveTokenIndex = Fn.new { |z|
             if (z == -47 || z == -38 || z == -94) {
                ti = ti + 1
             } else if (z == -64) {
                ti = ti + 2
             }
        }

        // Checks if there's another mini-pattern in the current capture and if so prepares to match it.
        var nextMiniPattern = Fn.new {
            while (true) {
                ti = ti + 1
                t = tokens[ti]
                if (t == -93) return false // end of capture
                if (t == -124) {
                    cm = ""
                    si = ci
                    c = codes[si]
                    break
                }
            }
            return true
        }

        // Checks that there are no more options to consider before declaring a non-match.
        var noMore = Fn.new { !cap || !nextMiniPattern.call() }
 
        // Checks if character class matches and if so consumes character.
        var slash = Fn.new { |inc|
            if (inc) ti = ti + 1
            var u = tokens[ti]
            if (u >= 8 && u <= 10) {
                if (!__fns[u].call(c, this)) return false
            } else if (u < 26) {
                if (!__fns[u].call(c)) return false
            } else if (u >= 34 && u <= 36) {
                if (__fns[u-26].call(c, this)) return false
            } else {
                if (__fns[u-26].call(c)) return false
            }
            consume.call()
            return true
        }

        // Checks if extended class matches and if so consumes character.
        var ampersand = Fn.new { |inc|
            if (inc) ti = ti + 1
            var u = tokens[ti]
            if (u >= 8 && u <= 10) {
                if (!__fns2[u].call(c, this)) return false
            } else if (u < 26) {
                if (!__fns2[u].call(c)) return false
            } else if (u >= 34 && u <= 36) {
                if (__fns2[u-26].call(c, this)) return false
            } else {
                if (__fns2[u-26].call(c)) return false
            }
            consume.call()
            return true
        }

        // Checks if complement matches and if so consumes character.
        var caret = Fn.new { |inc|
            if (inc) ti = ti + 1
            var u = tokens[ti]
            if (c == u) return false
            consume.call()
            return true
        }

        // Checks if either-case matches and if so consumes character.
        var at = Fn.new { |inc|
            var u
            var v
            if (inc) {
                u = tokens[ti + 1]
                v = tokens[ti + 2]
                ti = ti + 2
            } else {
                u = tokens[ti-1]
                v = tokens[ti]
            }
            if (u > v) { // lower case first
                if (c != u && c != v) return false
            } else {
                if (c == u || c == v) return false
            }
            consume.call()
            return true
        }

        // Checks if ordinary character matches and if so consumes character.
        var character = Fn.new { |z|
            if (c != z) return false
            consume.call()
            return true
        }
         
        while (true) {
            for (i in 1..1) { // dummy loop so break can emulate goto
                if (t == -47) {  // slash = character class
                    if (si == sc) if (noMore.call()) return null else break
                    if (!slash.call(true) && noMore.call()) return null
                } else if (t == -38) { // ampersand = extended class
                    if (si == sc) if (noMore.call()) return null else break
                    if (!ampersand.call(true) && noMore.call()) return null
                } else if (t == -94) { // caret = complement
                    if (si == sc) if (noMore.call()) return null else break
                    if (!caret.call(true) && noMore.call()) return null
                } else if (t == -64) { // at sign = either-case
                    if (si == sc) if (noMore.call) return null else break
                    if (!at.call(true) && noMore.call()) return null
                } else if (t == -61 || t == -43 || t == -35) { // quantifier
                    ti = ti + 1
                    var required = tokens[ti]
                    var reps
                    if (t == -61) {        // equals sign = multiple
                        reps = required
                    } else if (t == -43) { // plus sign = minimum
                        reps = Num.largest
                    } else if (t == -35) { // hash sign = range
                        ti = ti + 1
                        reps = tokens[ti]
                    }
                    ti = ti + 1
                    var z = tokens[ti]
                    if (si == sc) {
                        if (required > 0) {
                            if (noMore.call()) return null else break
                        } else {
                            moveTokenIndex.call(z)
                            break
                        }
                    }                  
                    if (z >= 0) { // ordinary character
                        for (i in 1..reps) {
                            if (!character.call(z)) {
                                if (i <= required && noMore.call()) return null
                                break
                            }
                            if (i == reps) break
                            si = si + 1
                            consumed = false
                            if (si == sc) {
                                if (i < required && noMore.call()) return null          
                                break
                            }
                            c = codes[si]
                        }
                    } else if (z == -47) { // character class
                        for (i in 1..reps) {
                            if (!slash.call(i == 1)) {
                                if (i <= required && noMore.call()) return null
                                break
                            }
                            if (i == reps) break
                            si = si + 1
                            consumed = false
                            if (si == sc) {
                                if (i < required && noMore.call()) return null         
                                break
                            }
                            c = codes[si]
                        }
                    } else if (z == -38) { // extended class
                        for (i in 1..reps) {
                            if (!ampersand.call(i == 1)) {
                                if (i <= required && noMore.call()) return null
                                break
                            }
                            if (i == reps) break
                            si = si + 1
                            consumed = false
                            if (si == sc) {
                                if (i < required && noMore.call()) return null
                                break
                            }
                            c = codes[si]
                        }
                    } else if (z == -94) { // complement
                        for (i in 1..reps) {
                            if (!caret.call(i == 1)) {
                                if (i <= required && noMore.call()) return null
                                break
                            }
                            if (i == reps) break
                            si = si + 1
                            consumed = false
                            if (si == sc) {
                                if (i < required && noMore.call()) return null           
                                break
                            }
                            c = codes[si]
                        }
                    } else if (z == -64) { // either-case
                        for (i in 1..reps) {
                            if (!at.call(i == 1)) {
                                if (i <= required && noMore.call()) return null
                                break
                            }
                            if (i == reps) break
                            si = si + 1
                            consumed = false
                            if (si == sc) {
                                if (i < required && noMore.call()) return null
                                break
                            }
                            c = codes[si]
                        }
                    }               
                } else if (t == -91) { // capture opening
                    cap = true
                    cm = ""
                    ci = si
                } else if (t == -124) { // end of mini-pattern
                    captures.add(Capture.new_(cm, ci))
                    wm = wm + cm
                    cap = false
                    while (true) { // find capture end
                        ti = ti + 1
                        t = tokens[ti]
                        if (t == -93) break
                    }
                } else if (t == -93) { // capture end
                    captures.add(Capture.new_(cm, ci))
                    wm = wm + cm
                    cap = false
                } else if (t == -36) { // back-reference
                    ti = ti + 1
                    var cn = tokens[ti]
                    var text = (cn > 0) ? captures[cn-1].text : wm
                    if (si == sc && text.Count > 0 && noMore.call()) return null
                    var tokens1 = tokens[0..ti]
                    var tokens2 = tokens[ti+1..-1]
                    tokens = tokens1 + text.codePoints.toList + tokens2
                    tc = tokens.count
                } else { // ordinary character
                    if (si == sc) if (noMore.call()) return null else break
                    if (!character.call(t) && noMore.call()) return null
                }
            } // end for loop
            ti = ti + 1
            if (ti == tc) break
            t = tokens[ti]
            if (consumed) {
                si = si + 1
                consumed = false
                if (si < sc) {
                    c = codes[si]
                }
            }
        }

        return Match.new_(wm, start, captures)
    }

    // Converts any metacharacters in 'p' to their literal equivalents.
    static escape(p) {
        var mcs = "/&^@=+#~[]|$"
        for (mc in mcs) {
           p = p.replace(mc, "/%(mc)")
        }
        return p
    }

    // Constructs a Pattern object from a pattern, its type and its user defined character
    // classes. If an empty string is passed for the latter, they use their defaults.      
    construct new(pattern, type, i, j, k) {
        if (!((pattern is String) && pattern != "")) {
            Fiber.abort("Pattern must be a non-empty string.")
        }
        var mt = Pattern.validate_(pattern)
        _minLen = mt[0]
        _tokens = mt[1]
        _pattern = pattern
        if (!((type is Num) && type.isInteger && type >= 0 &&  type <= 3)) {
             Fiber.abort("Pattern type must be an integer between 0 and 3 inclusive.")
        }
        _type = type
        if (!((i is String) && (j is String) && (k is String))) {
            Fiber.abort("Used defined class must be a string.")
        }
        _i = (i != "") ? i : "012"
        _j = (j != "") ? j : "0123"
        _k = (k != "") ? k : "01234"
    }

    // Convenience methods which call the constructor with default values for some arguments.
    static new(pattern, type, i, j) { new(pattern, type,  i,  j, "") }
    static new(pattern, type, i)    { new(pattern, type,  i, "", "") }
    static new(pattern, type)       { new(pattern, type, "", "", "") }
    static new(pattern)             { new(pattern,    0, "", "", "") }

    // Properties.
    pattern  { _pattern }   // the pattern string
    type     { _type    }   // its type
    minLen   { _minLen  }   // its minimum matching length (possibly zero)
    i        { _i }         // the user defined character class represented by /i
    j        { _j }         // the user defined character class represented by /j
    k        { _k }         // the user defined character class represented by /k

    // Checks whether the pattern matches a string or not.
    isMatch(s) { find(s) != null }

    // Finds and returns the first match (as a Match object) or null if there are no matches.
    find(s) {
        if (!(s is String)) Fiber.abort("Argument must be a string.")
        var codes = s.codePoints.toList
        var sc = codes.count
        if (sc < _minLen) return null
        if (_type == Pattern.within) {
            var maxStart = sc - _minLen
            for (start in 0..maxStart) {
                var m = match_(codes, start, sc)
                if (m) return m
            }
            return null
        }
        if (_type == Pattern.start) return match_(codes, 0, sc)
        if (_type == Pattern.end) {
            var maxStart = sc - _minLen
            for (start in 0..maxStart) {
                var m = match_(codes, start, sc)
                if (m && ((start + m.length) == sc)) return m
            }
            return null
        }
        if (_type == Pattern.whole) {
            var m = match_(codes, 0, sc)
            if (!m || m.length < sc) return null
            return m
        }
    }

    // Finds and returns all successive non-overlapping matches, if there are any,
    // as a list of Match objects. The list will be empty if there are no matches.
    // To prevent infinite recursion, it stops at (but includes) the first empty match.
    // Note that apart from Pattern.within there can never be more than one match.
    findAll(s) {
        var m = find(s)
        if (!_type == Pattern.within) {
            return (m) ? [m] : []
        }
        if (!m) return []
        var matches = [m]
        if (m.length == 0) return matches
        var codes = s.codePoints.toList
        var sc = codes.count
        var start = m.index + m.length
        while (start + _minLen <= sc) {
            m = match_(codes, start, sc)
            if (m) {
                matches.add(m)
                if (m.length == 0) break
                start = start + m.length
            } else {
                start = start + 1
            }
        }
        return matches
    }

    // As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
    // as matching any characters other than the string of literal characters 't'.
    // Should not be used if 's' might contain the SO (shift out) character '0x0e'.
    findLazy(s, t) {
        var SO = "\x0e"
        var rep = SO * t.count
        s = s.replace(t, rep)
        var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep)
        var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
        var m = p2.find(s)
        if (!m) return null
        var captures2 = []
        for (c in m.captures) {
            captures2.add(Capture.new_(c.text.replace(rep, t), c.index))
        }
        var text2 = m.text.replace(rep, t)
        return Match.new_(text2, m.index, captures2)
    }

    // As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
    // as matching any characters other than the string of literal characters 't' and '&Z' within
    // the pattern as matching any characters other than the string of literal characters 'u'.
    // Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
    // SI (shift in) character '0x0f'.
    findLazy2(s, t, u) {
        var SO = "\x0e"
        var SI = "\x0f"
        var rep1 = SO * t.count
        var rep2 = SI * u.count
        s = s.replace(t, rep1).replace(u, rep2)
        var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep1)
                               .replace("&Z", "^%(SI)").replace(Pattern.escape(u), rep2)
        var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
        var m = p2.find(s)
        if (!m) return null
        var captures2 = []
        for (c in m.captures) {
            captures2.add(Capture.new_(c.text.replace(rep1, t).replace(rep2, u), c.index))
        }
        var text2 = m.text.replace(rep1, t).replace(rep2, u)
        return Match.new_(text2, m.index, captures2)
    }

    // As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
    // as matching the string of literal characters 't'.
    // Should not be used if 's' might contain the SO (shift out) character '0x0e'.
    findWithGroup(s, t) {
        var SO = "\x0e"
        s = s.replace(t, SO)
        var indexMap = List.filled(s.count, 0)
        var i = 0
        var j = 0
        var d = t.count - 1
        for (c in s) {
            indexMap[i] = j
            if (c == SO) j = j + d
            i = i + 1
            j = j + 1
        }
        var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
        var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
        var m = p2.find(s)
        if (!m) return null
        var captures2 = []
        for (c in m.captures) {
            captures2.add(Capture.new_(c.text.replace(SO, t), indexMap[c.index]))
        }
        var text2 = m.text.replace(SO, t)
        return Match.new_(text2, indexMap[m.index], captures2)
    }

    // As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
    // as matching the string of literal characters 't' and '&Z' within the pattern as matching
    // the string of literal characters 'u'.
    // Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
    // SI (shift in) character '0x0f'.
    findWithGroup2(s, t, u) {
        var SO = "\x0e"
        var SI = "\x0f"
        s = s.replace(t, SO).replace(u, SI)
        var indexMap = List.filled(s.count, 0)
        var i = 0
        var j = 0
        var d1 = t.count - 1
        var d2 = u.count - 1
        for (c in s) {
            indexMap[i] = j
            if (c == SO) {
                j = j + d1
            } else if (c == SI) {
                j = j + d2
            }
            i = i + 1
            j = j + 1
        }
        var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
                               .replace("&Z", SI).replace(Pattern.escape(u), SI)
        var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
        var m = p2.find(s)
        if (!m) return null
        var captures2 = []
        for (c in m.captures) {
            captures2.add(Capture.new_(c.text.replace(SO, t).replace(SI, u), indexMap[c.index]))
        }
        var text2 = m.text.replace(SO, t).replace(SI, u)
        return Match.new_(text2, indexMap[m.index], captures2)
    }

    // Replaces up to 'n' successive matches in 's', optionally skipping some of those 'n', by the     
    // replacement string 'repl'. If there are no (or not enough) matches, returns 's' itself.
    // If n <= 1, uses all matches as separators.
    replace(s, repl, n, skip) {
        if (!(s is String) || !(repl is String)) Fiber.abort("First two arguments must be strings.")
        if (!((n is Num) && n.isInteger)) Fiber.abort("Third argument must be an integer.")
        if (!((skip is Num) && skip.isInteger && skip >= 0)) {
            Fiber.abort("Fourth argument must be a non-negative integer.")
        }
        var matches = findAll(s)
        var c = matches.count
        if (c == 0) return s
        if (n < 1 || n > c) n = c
        if (n <= skip) return s
        if (skip > 0 || n < c) matches = matches[skip...n]
        var cps = s.codePoints.toList
        var addIndex = 0
        for (m in matches) {
            var caps = m.captures
            var count = 0
            for (br in __backRefs[0...caps.count]) {
                repl = repl.replace(br, caps[count].text)
                count = count + 1
            }
            repl = repl.replace("$0", m.text)
            repl = repl.replace("$$", "$")               
            var s1 = cps[0...addIndex + m.index]
            var s2 = repl.codePoints.toList
            var s3 = cps[addIndex + m.index + m.length..-1]        
            cps = s1 + s2 + s3
            addIndex = addIndex + s2.count - m.length
        }
        return cps.map { |cp| String.fromCodePoint(cp) }.join()
    }

    // Convenience version of the above method which replaces all matches.
    replaceAll(s, repl) { replace(s, repl, 0, 0) }

    // Splits the string into a list of up to 'n+1' substrings using pattern matches as the separators
    // optionally skipping some of those 'n' separators.
    // If there are no matches returns a list with a single element, 's' itself.
    // If n < 1, uses all the matches as separators.
    split(s, n, skip) {
        if (!(s is String)) Fiber.abort("First argument must be a string.")
        if (!((n is Num) && n.isInteger)) Fiber.abort("Second argument must be an integer.")
        if (!((skip is Num) && skip.isInteger && skip >= 0)) {
            Fiber.abort("Third argument must be a non-negative integer.")
        }
        var matches = findAll(s)
        var c = matches.count
        if (c == 0) return [s]
        if (n < 1 || n > c) n = c
        if (n <= skip) return [s]
        if (skip > 0 || n < c) matches = matches[skip...n]
        var cps = s.codePoints.toList
        var splits = []
        var prev = 0
        for (m in matches) {
            var next = m.index
            var item = cps[prev...next]
            splits.add(item.map { |cp| String.fromCodePoint(cp) }.join())
            prev = next + m.length
        }
        splits.add(cps[prev..-1].map { |cp| String.fromCodePoint(cp) }.join())
        return splits
    }

    // Convenience version of the above method which uses all the matches as separators.
    splitAll(s) { split(s, 0, 0) }

    // String representation (excluding user defined character classes).
    toString { "{ pattern = %(_pattern), type = %(Pattern.types[_type]), min length = %(_minLen) }" }
}

Pattern.init_()