Jump to content

Category talk:Wren-pattern: Difference between revisions

m
→‎Source code: Now uses Wren S/H lexer.
(→‎Source code: Performance improvement for large strings.)
m (→‎Source code: Now uses Wren S/H lexer.)
 
(5 intermediate revisions by the same user not shown)
Line 1:
===Wren patterns===
 
Wren doesn't have direct access to a regular expression library and,(but evensee if it did, they would be tedious to use as all metaWren-charactersregex wouldsection needbelow) to be ''double escaped'' due to the language not supporting ''raw'' strings either. Moreover,and writing such a library from scratch in a small scripting language such as Wren may not be a viable proposition due to its likely size and performance limitations.
 
I have therefore designed and coded a simple pattern matcher instead, the rules of which are described below. Obviously, this has nothing like the full power of regular expressions but I hope it will nevertheless be a useful addition to the armory.
Line 13:
Multiples, minima, ranges and optionals are referred to as ''quantifiers'' as they specify how many times a ''single'' needs to be repeated for a match to occur.
 
Note that quantifiers are always ''greedy''; they match as many times as they can, never take into account whether the next symbol in the pattern would match and never backtrack to try and change an unsuccessful match into a successful one. Clearly, this needs to be borne in mind when using them in a pattern. However, there is a way to make them act ''lazily'' which will be discussed later.
 
Captures match a group of 0 or more singles and may include alternatives which are considered in turn until a match occurs. Finally, back-references refer to text matched by a previous capture.
Line 72:
Although the ''standard'' character classes should suffice for most purposes, the user can redefine up to three of them (i, j and k) for each pattern to deal with special cases, such as a limited range of letters or digits.
 
An upper case character class represents the complement of the lower case version. For example /A matches any character other than a-z or A-Z, including non-ASCII characters. Note that /Z normally just matches Z itself as it's not possible, of course, to match no characters. However, it does have a special meaning for 'lazy' and 'quantified group' matches which will be covered later.
 
/ followed by any character other than a letter just matches the character itself. This allows the 12 meta-characters to be treated as ''literal'' characters without their special meaning. So /| is a literal vertical bar.
Line 101:
The upper case version of an extended class represents the complement of the lower case version. For example &N matches any character other than a sign.
 
& followed by any other letter or character behaves exactly the same as if it were preceded by / except that &Z has a special meaning for 'lazy' and 'quantified group' matches which will be covered later.
 
;Complements
Line 162:
 
These are patterns which are used in the appropriate methods as replacements for a matching pattern. They are treated as normal text except that they can contain back-references ($0 always refers to the whole match) and a literal $ must be escaped with $$ (not the usual /$).
 
;Quantification of groups of characters
 
Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified.
 
There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times.
 
However, this sort of approach clearly has its limitations and there is no way to match a group of characters repeated an indefinite number of times.
 
;Examples
Line 187 ⟶ 179:
 
"The quick brown [fox|c/l/l]" matches: The quick brown fox|cat|cow.
 
;Lazy matching
 
The pattern "<a>[+0/z]<//a>" fails to match the string "<a>b < c</a>" because +0/z is 'greedy' and matches all characters to the end of the string including the final "</a>".
 
If you change the pattern to "<a>[+0^<]<//a>" it still fails to match the string because of the presence of '<' in the text between the tags.
 
However, if you change the pattern to "<a>[+0/Z]<//a>" and use the 'findLazy' method with a 't' parameter of "</a>", then a match will occur and the captured text will be 'b < c'. What we did here was to replace '/z' or '^<' in the pattern with '/Z' which, in a lazy method, means match any characters other than the parameter string 't'. In the non-lazy 'find' method '/Z' would just match a literal 'Z' as noted earlier.
 
The 'findLazy2' method works in a similar fashion but can carry out 2 lazy matches by using '/Z' for the first and '&Z' for the second which match any characters other than the parameter strings 't' and 'u' respectively.
 
These methods do not actually change the 'greedy' nature of the engine but use a hack (replacing text with rarely used control characters and back again after matching) to simulate lazy matching to a limited extent.
 
;Quantification of groups of characters
 
Quantifiers always qualify ''singles''. Any departure from this is an error and groups of characters cannot therefore be directly quantified.
 
There may sometimes be ways to quantify them indirectly either by simply repeating the pattern or by using captures and back-references. For example "[abab|ab|]" would match 'ab' repeated 2, 1 or 0 times and "[abcd]$1$1" would match 'abcd' repeated exactly three times. However, this sort of approach clearly has its limitations.
 
A different and usually better approach is to use the 'findWithGroup' or 'findWithGroup2' methods. These work in an analogous way to the 'findLazy' and 'findLazy2' methods. However, this time '/Z' and '&Z' respectively match the parameter strings 't' and 'u' themselves rather than any characters other than these strings and we can therefore quantify them as though they were 'singles'.
 
;Wren-regex
Since this module was written, I've created a Wren-regex module which wraps Go's 'regexp' package. Whilst this addresses some shortcomings in Wren-pattern, it requires a special Go executable to use it and doesn't therefore work with Wren-cli.
 
===Source code===
<langsyntaxhighlight ecmascriptlang="wren">/* Module "pattern.wren" */
 
/* Match represents a single successful match made by methods in the Pattern class.
Line 799 ⟶ 814:
 
return Match.new_(wm, start, captures)
}
 
// Converts any metacharacters in 'p' to their literal equivalents.
static escape(p) {
var mcs = "/&^@=+#~[]|$"
for (mc in mcs) {
p = p.replace(mc, "/%(mc)")
}
return p
}
 
Line 896 ⟶ 920:
}
return matches
}
 
// As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
// as matching any characters other than the string of literal characters 't'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e'.
findLazy(s, t) {
var SO = "\x0e"
var rep = SO * t.count
s = s.replace(t, rep)
var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(rep, t), c.index))
}
var text2 = m.text.replace(rep, t)
return Match.new_(text2, m.index, captures2)
}
 
// As the 'find' method but can simulate lazy matching by treating '/Z' within the pattern
// as matching any characters other than the string of literal characters 't' and '&Z' within
// the pattern as matching any characters other than the string of literal characters 'u'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
// SI (shift in) character '0x0f'.
findLazy2(s, t, u) {
var SO = "\x0e"
var SI = "\x0f"
var rep1 = SO * t.count
var rep2 = SI * u.count
s = s.replace(t, rep1).replace(u, rep2)
var pattern2 = _pattern.replace("/Z", "^%(SO)").replace(Pattern.escape(t), rep1)
.replace("&Z", "^%(SI)").replace(Pattern.escape(u), rep2)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(rep1, t).replace(rep2, u), c.index))
}
var text2 = m.text.replace(rep1, t).replace(rep2, u)
return Match.new_(text2, m.index, captures2)
}
 
// As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
// as matching the string of literal characters 't'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e'.
findWithGroup(s, t) {
var SO = "\x0e"
s = s.replace(t, SO)
var indexMap = List.filled(s.count, 0)
var i = 0
var j = 0
var d = t.count - 1
for (c in s) {
indexMap[i] = j
if (c == SO) j = j + d
i = i + 1
j = j + 1
}
var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(SO, t), indexMap[c.index]))
}
var text2 = m.text.replace(SO, t)
return Match.new_(text2, indexMap[m.index], captures2)
}
 
// As the 'find' method but can simulate quantified group matching by treating '/Z' within the pattern
// as matching the string of literal characters 't' and '&Z' within the pattern as matching
// the string of literal characters 'u'.
// Should not be used if 's' might contain the SO (shift out) character '0x0e' or the
// SI (shift in) character '0x0f'.
findWithGroup2(s, t, u) {
var SO = "\x0e"
var SI = "\x0f"
s = s.replace(t, SO).replace(u, SI)
var indexMap = List.filled(s.count, 0)
var i = 0
var j = 0
var d1 = t.count - 1
var d2 = u.count - 1
for (c in s) {
indexMap[i] = j
if (c == SO) {
j = j + d1
} else if (c == SI) {
j = j + d2
}
i = i + 1
j = j + 1
}
var pattern2 = _pattern.replace("/Z", SO).replace(Pattern.escape(t), SO)
.replace("&Z", SI).replace(Pattern.escape(u), SI)
var p2 = Pattern.new(pattern2, _type, _i, _j, _k)
var m = p2.find(s)
if (!m) return null
var captures2 = []
for (c in m.captures) {
captures2.add(Capture.new_(c.text.replace(SO, t).replace(SI, u), indexMap[c.index]))
}
var text2 = m.text.replace(SO, t).replace(SI, u)
return Match.new_(text2, indexMap[m.index], captures2)
}
 
Line 972 ⟶ 1,104:
}
 
Pattern.init_()</langsyntaxhighlight>
9,476

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.