Tokenize a string with escaping: Difference between revisions

→‎{{header|Python}}: add regex-based version
(Add CLU)
(→‎{{header|Python}}: add regex-based version)
Line 2,700:
{{Out}}
<pre>['one|uno', '', 'three^^', 'four^|cuatro', '']</pre>
 
===Regex-based===
 
The python <code>re</code> library has a handy class <code>Scanner</code> which is intended precisely for this use-case.
It takes a list of pairs '''regex, action''' and whenever it encounters '''regex''' in the input, it executes '''action'''.
This allows us to solve this task very efficiently with minimum effort, the hardest part being the correct definition and ordering of the regular expressions.
 
The following code also illustrates an important feature of Python ‒ nested functions with closures.
Owing to this feature, the inner functions, such as <code>start_new_token</code>, are able to access the local variable <code>tokens</code> of their enclosing function <code>tokenize</code>. For the inner function, the name <code>tokens</code> is ''nonlocal'', and is in the ''enclosing scope'' of the inner function (as opposed to the parameters <code>scanner</code> and <code>substring</code>, which are in the local scope).
 
<lang python>import re
 
STRING = 'one^|uno||three^^^^|four^^^|^cuatro|'
 
def tokenize(string=STRING, escape='^', separator='|'):
 
escape, separator = map(re.escape, (escape, separator))
 
tokens = ['']
 
def start_new_token(scanner, substring):
tokens.append('')
 
def add_escaped_char(scanner, substring):
char = substring[1]
tokens[-1] += char
 
def add_substring(scanner, substring):
tokens[-1] += substring
 
re.Scanner([
# an escape followed by a character produces that character
(fr'{escape}.', add_escaped_char),
 
# when encountering a separator not preceded by an escape,
# start a new token
(fr'{separator}', start_new_token),
 
# a sequence of regular characters (i.e. not escape or separator)
# is just appended to the token
(fr'[^{escape}{separator}]+', add_substring),
]).scan(string)
 
return tokens
 
 
if __name__ == '__main__':
print(list(tokenize()))
</lang>
 
Output is the same as in the functional Python version above.
 
=={{header|Racket}}==