Brace expansion: Difference between revisions

Content added Content deleted

Inline

Revision as of 15:33, 5 February 2014

Brace expansion is a type of parameter expansion made popular by Unix shells, where it allows users to specify multiple similar string parameters without having to type them all out. E.g. the parameter enable_{audio,video} would be interpreted as if both enable_audio and enable_video had been specified.

The Task

Write a function that can perform brace expansion on any input string, according to the following specification.
Demonstrate how it would be used, and that it passes the four test cases given below.

Specification

In the input string, balanced pairs of braces containing comma-separated substrings (details below) represent alternations that specify multiple alternatives which are to appear at that position in the output. In general, one can imagine the information conveyed by the input string as a tree of nested alternations interspersed with literal substrings, as shown in the middle part of the following diagram:

It{{em,alic}iz,erat}e{d,}

parse
―――――▶
‌

It

⎧
⎨
⎩

⎧ ⎨ ⎩	em	⎫ ⎬ ⎭
	alic

iz

⎫
⎬
⎭

erat

e

⎧ ⎨ ⎩	d	⎫ ⎬ ⎭
	‌

expand
―――――▶
‌

Itemized
Itemize
Italicized
Italicize

Iterated
Iterate

input string

alternation tree

output (list of strings)

This tree can in turn be transformed into the intended list of output strings by, colloquially speaking, determining all the possible ways to walk through it from left to right while only descending into one branch of each alternation one comes across (see the right part of the diagram). When implementing it, one can of course combine the parsing and expansion into a single algorithm, but this specification discusses them separately for the sake of clarity.

Expansion of alternations can be more rigorously described by these rules:

a

⎧ ⎨ ⎩	2	⎫ ⎬ ⎭
	1

b

⎧ ⎨ ⎩	X	⎫ ⎬ ⎭
	Y
	X

c

⟶

a2bXc

a2bYc

a2bXc

a1bXc

a1bYc

a1bXc

When an alternation is encountered, the list of alternatives that will be produced by its parent branch is increased 𝑛-fold, each copy featuring one of the 𝑛 alternatives produced by the alternation's child branches, in turn, at that position.
This means that multiple alternations inside the same branch are cumulative (i.e. the complete list of alternatives produced by a branch is the string-concatenating "Cartesian product" of its parts).
All alternatives (even duplicate and empty ones) are preserved, and they are ordered like the examples demonstrate (i.e. "lexicographically" with regard to the alternations).
The alternatives produced by the root branch constitute the final output.

Parsing the input string involves some additional complexity to deal with escaped characters and "incomplete" brace pairs:

a\\{\{b,c\,d}

⟶

a\\

⎧ ⎨ ⎩	\{b	⎫ ⎬ ⎭
	c\,d

{a,b{c{,$}d}e

⟶

{a,b{c

⎧ ⎨ ⎩	‌	⎫ ⎬ ⎭
	$

d}e

An unescaped backslash which precedes another character, escapes that character (to force it to be treated as literal). The backslashes are passed along to the output unchanged.
Balanced brace pairs are identified by, conceptually, going through the string from left to right and associating each unescaped closing brace that is encountered with the nearest still unassociated unescaped opening brace to its left (if any). Furthermore, each unescaped comma is associated with the innermost brace pair that contains it (if any). With that in mind:
- Each brace pair that has at least one comma associated with it, forms an alternation (whose branches are the brace pair's contents split at its commas). The associated brace and comma characters themselves do not become part of the output.
- Brace characters from pairs without any associated comma, as well as unassociated brace and comma characters, as well as all characters that are not covered by the preceding rules, are instead treated as literals.

For every possible input string, your implementation should produce exactly the output which this specification mandates. Please comply with this even when it's inconvenient, to ensure that all implementations are comparable. However, none of the above should be interpreted as instructions (or even recommendations) for how to implement it. Try to come up with a solution that is idiomatic in your programming language. (See #Perl for a reference implementation.)

Test Cases

Input	Ouput
`~/{Downloads,Pictures}/*.{jpg,gif,png}`	`~/Downloads/.jpg` `~/Downloads/.gif` `~/Downloads/.png` `~/Pictures/.jpg` `~/Pictures/.gif` `~/Pictures/.png`
`It{{em,alic}iz,erat}e{d,}, please.`	`Itemized, please.` `Itemize, please.` `Italicized, please.` `Italicize, please.` `Iterated, please.` `Iterate, please.`
`{,{,gotta have{ ,\, again\, }}more }cowbell!`	`cowbell!` `more cowbell!` `gotta have more cowbell!` `gotta have\, again\, more cowbell!`
`{}} some }{,{\\{ edge, edge} \,}{ cases, {here} \\\\\}`	`{}} some }{,{\\ edge \,}{ cases, {here} \\\\\}` `{}} some }{,{\\ edge \,}{ cases, {here} \\\\\}`

Haskell

"Here is a direct translation to Haskell using parsec" (of an earlier version of the Perl 6 solution):

<lang haskell>import Control.Applicative (pure, (<$>), (<*>)) import Control.Monad (forever) import Text.Parsec

parser :: Parsec String u [String] parser = expand <$> many (try alts <|> try alt1 <|> escape <|> pure . pure <$> anyChar)

   where alts = concat <$> between (char '{') (char '}') (alt `sepBy2` char ',')
         alt1 = (\s -> ["{" ++ s ++ "}"]) <$> between (char '{') (char '}') (many $ noneOf ",{}")
         alt = expand <$> many (try alts <|> try alt1 <|> escape <|> pure . pure <$> noneOf ",}")
         escape = pure <$> sequence [char '\\', anyChar]
         expand = foldr (\x xs -> (++) <$> x <*> xs) [""]
         p `sepBy2` sep = (:) <$> p <*> many1 (sep >> p)

main :: IO () main = forever $ parse parser [] <$> getLine >>= either print (mapM_ putStrLn)</lang>

Output:

$ ./bracex
~/{Downloads,Pictures}/*.{jpg,gif,png}
~/Downloads/*.jpg
~/Downloads/*.gif
~/Downloads/*.png
~/Pictures/*.jpg
~/Pictures/*.gif
~/Pictures/*.png
It{{em,alic}iz,erat}e{d,}, please.
Itemized, please.
Itemize, please.
Italicized, please.
Italicize, please.
Iterated, please.
Iterate, please.
{,{,gotta have{ ,\, again\, }}more }cowbell!
cowbell!
more cowbell!
gotta have more cowbell!
gotta have\, again\, more cowbell!
{}} some {\\{edge,edgy} }{ cases, here\\\}
{}} some {\\edge }{ cases, here\\\}
{}} some {\\edgy }{ cases, here\\\}
a{b{1,2}c
a{b1c
a{b2c
a{1,2}b}c
a1b}c
a2b}c
a{1,{2},3}b
a1b
a{2}b
a3b
a{b{1,2}c{}}
a{b1c{}}
a{b2c{}}
^D
bracex: <stdin>: hGetLine: end of file

J

Implementation:

<lang J> NB. legit { , and } do not follow a legit backslash: legit=: 1,_1}.4>(3;(_2[\"1".;._2]0 :0);('\';a.);0 _1 0 1)&;:&.(' '&,)

2 1   1 1 NB. result 0 or 1: initial state
2 2   1 2 NB. result 2 or 3: after receiving a non backslash
1 2   1 2 NB. result 4 or 5: after receiving a backslash

)

expand=:3 :0

 Ch=. u:inv y
 M=. N=. 1+>./ Ch
 Ch=. Ch*-_1^legit y
 delim=. 'left comma right'=. u:inv '{,}'
 J=. i.K=. #Ch
 while. M=. M+1 do.
   candidates=. i.0 2
   for_check.I. comma=Ch do.
     begin=. >./I. left=check{. Ch
     end=. check+<./I. right=check}. Ch
     if. K>:end-begin do.
       candidates=. candidates,begin,end
     end.
   end.
   if. 0=#candidates do. break. end.
   'begin end'=. candidates{~(i.>./) -/"1 candidates
   ndx=. I.(begin<:J)*(end>:J)*Ch e. delim
   Ch=. M ndx} Ch 
 end.
 T=. ,<Ch
 for_mark. |.N}.i.M  do.
   T=. ; mark divide each T
 end.
 u: each |each T

)

divide=:4 :0

 if. -.x e. y do. ,<y return. end.
 mask=. x=y
 prefix=. < y #~ -.+./\ mask
 suffix=. < y #~ -.+./\. mask
 options=. }:mask <;._1 y
 prefix,each options,each suffix

)</lang>

Examples:

<lang J> >expand t1 ~/Downloads/*.jpg ~/Downloads/*.gif ~/Downloads/*.png ~/Pictures/*.jpg ~/Pictures/*.gif ~/Pictures/*.png

  > expand t2

Itemized, please. Itemize, please. Italicized, please. Italicize, please. Iterated, please. Iterate, please.

  >expand t3

cowbell! more cowbell! gotta have more cowbell! gotta have\, again\, more cowbell!

  >expand t4

{}} some {\\edge }{ cases, here\\\} {}} some {\\edgy }{ cases, here\\\}</lang>

Explanation:

Instead of working directly with text, work with a string of numeric unicode values. Negate the numbers for characters which are "off limits" because of preceding backslashes (we will take the absolute value and convert back to unicode for the final result). Also, find a limit value larger than that of the largest in-use character.

Then, iteratively: for each relevant comma, find the location of the closest surrounding braces. From these candidates, pick a pair of braces that's the shortest distance apart. Mark those braces and their contained relevant commas by replacing their character codes with an integer larger than any previously used (all of them in the set get marked wit the same number). Repeat until we cannot find any more possibilities.

Finally, for each integer that we've used to mark delimiter locations, split out each of the marked options (each with a copy of that group's prefix and suffix). (Then when all that is done, take the absolute values convert back to unicode for the final result.)

Perl

Perl has a built-in glob function which does brace expansion, but it can't be used to solve this task because it also does shell-like word splitting, wildcard expansion, and tilde expansion at the same time. The File::Glob core module gives access to the more low-level bsd_glob function which actually supports exclusive brace expansion, however it does not comply with this task's specification when it comes to preserving backslashes and handling unbalanced brace characters.

So here is a manual solution that implements the specification precisely:

<lang perl>sub brace_expand {

   my $input = shift;
   my @stack = ([my $current = []]);
   
   while ($input =~ /\G ((?:[^\\{,}]++ | \\(?:.|\z))++ | . )/gx) {
       if ($1 eq '{') {
           push @stack, [$current = []];
       }
       elsif ($1 eq ',' && @stack > 1) {
           push @{$stack[-1]}, ($current = []);
       }
       elsif ($1 eq '}' && @stack > 1) {
           my $group = pop @stack;
           $current = $stack[-1][-1];
           
           # handle the case of brace pairs without commas:
           @{$group->[0]} = map { "{$_}" } @{$group->[0]} if @$group == 1;
           
           @$current = map {
               my $c = $_;
               map { map { $c . $_ } @$_ } @$group;
           } @$current;
       }
       else { $_ .= $1 for @$current; }
   }
   
   # handle the case of missing closing braces:
   while (@stack > 1) {
       my $right = pop @{$stack[-1]};
       my $sep;
       if (@{$stack[-1]}) { $sep = ',' }
       else               { $sep = '{'; pop @stack }
       $current = $stack[-1][-1];
       @$current = map {
           my $c = $_;
           map { $c . $sep . $_ } @$right;
       } @$current;
   }
   
   return @$current;

}</lang>

Usage demonstration: <lang perl>while (my $input = ) {

   chomp($input);
   print "$input\n";
   print "    $_\n" for brace_expand($input);
   print "\n";

}

__DATA__ ~/{Downloads,Pictures}/*.{jpg,gif,png} It{{em,alic}iz,erat}e{d,}, please. {,{,gotta have{ ,\, again\, }}more }cowbell! {}} some }{,{\\{ edge, edge} \,}{ cases, {here} \\\\\}</lang>

Output:

~/{Downloads,Pictures}/*.{jpg,gif,png}
    ~/Downloads/*.jpg
    ~/Downloads/*.gif
    ~/Downloads/*.png
    ~/Pictures/*.jpg
    ~/Pictures/*.gif
    ~/Pictures/*.png

It{{em,alic}iz,erat}e{d,}, please.
    Itemized, please.
    Itemize, please.
    Italicized, please.
    Italicize, please.
    Iterated, please.
    Iterate, please.

{,{,gotta have{ ,\, again\, }}more }cowbell!
    cowbell!
    more cowbell!
    gotta have more cowbell!
    gotta have\, again\, more cowbell!

{}} some }{,{\\{ edge, edge} \,}{ cases, {here} \\\\\}
    {}} some }{,{\\ edge \,}{ cases, {here} \\\\\}
    {}} some }{,{\\ edge \,}{ cases, {here} \\\\\}

Perl 6

The task description allows the taking of shortcuts, but please note that we are not taking any shortcuts here. The solution is short because this particular problem happens to map quite naturally to the strengths of Perl 6.

First, the parsing is handled with a grammar that can backtrack in the few places this problem needs it. The 2..* is the exact quantifier needed for valid items, and the % operator requires the comma between each quantified item to its left. Note that the * quantifiers do not backtrack here, because the token keyword suppresses that; all the backtracking here fails over to a different alternative in an outer alternation (that is, things separated by the | character in the grammar, which means match order is determined by longest token matching.)

On the other end, we recursively walk the parse tree returning expanded sublists, and we do the cartesian concatenation of sublists at each level by use of the X~ operator, which is a "cross" metaoperator used on a simple ~ concatenation. As a list infix operator, X~ does not care how many items are on either side, which is just what you want in this case, since some of the arguments are strings and some are lists. Here we use a fold or reduction form in square brackets to interpose the cross-concat between each value generated by the map, which returns a mixture of lists and literal strings. One other thing that might not be obvious: if we bind to the match variable, $/, we automatically get all the syntactic sugar for its submatches. In this case, $0 is short for $/[0], and represents all the submatches captured by 0th set of parens in either TOP or alt. $<meta> is likewise short for $/<meta>, and retrieves what was captured by that named submatch. <lang perl6>grammar BraceExpansion {

   token TOP { ( <meta> | . )* }

   token meta {
       | '{' <alts> '}'
       | <fakegroup>
       | \\ .
   }

   token alts { <alt> ** 2..* % ',' }

   token alt { ( <meta> | <-[ , } ]> )* }

   token fakegroup { '{' [ \\. | <-[ { , } ]> | <fakegroup> ]* '}' }

}

sub crosswalk($/) {

   [X~] , $0.map: -> $/ { [$<meta><alts><alt>».&crosswalk] or ~$/ }

}

sub brace-expand($s) { crosswalk BraceExpansion.parse($s) }</lang> And to test... <lang perl6>sub bxtest(*@s) {

   for @s -> $s {
       say "\n$s";
       for brace-expand($s) {
           say "    ", $_;
       }
   }

}

bxtest Q:to/END/.lines;

   ~/{Downloads,Pictures}/*.{jpg,gif,png}
   It{{em,alic}iz,erat}e{d,}, please.
   {,{,gotta have{ ,\, again\, }}more }cowbell!
   {}} some {\\{edge,edgy} }{ cases, here\\\}
   a{b{1,2}c
   a{1,2}b}c
   a{1,{2},3}b
   a{b{1,2}c{}}
   END</lang>

Output:

~/{Downloads,Pictures}/*.{jpg,gif,png}
    ~/Downloads/*.jpg
    ~/Downloads/*.gif
    ~/Downloads/*.png
    ~/Pictures/*.jpg
    ~/Pictures/*.gif
    ~/Pictures/*.png

It{{em,alic}iz,erat}e{d,}, please.
    Itemized, please.
    Itemize, please.
    Italicized, please.
    Italicize, please.
    Iterated, please.
    Iterate, please.

{,{,gotta have{ ,\, again\, }}more }cowbell!
    cowbell!
    more cowbell!
    gotta have more cowbell!
    gotta have\, again\, more cowbell!

{}} some {\\{edge,edgy} }{ cases, here\\\}
    {}} some {\\edge }{ cases, here\\\}
    {}} some {\\edgy }{ cases, here\\\}

a{b{1,2}c
    a{b1c
    a{b2c

a{1,2}b}c
    a1b}c
    a2b}c

a{1,{2},3}b
    a1b
    a{2}b
    a3b

a{b{1,2}c{}}
    a{b1c{}}
    a{b2c{}}

Python

<lang python>def getitem(s, depth=0):

   out = [""]
   while s:
       c = s[0]
       if depth and (c == ',' or c == '}'):
           return out,s
       if c == '{':
           x = getgroup(s[1:], depth+1)
           if x:
               out,s = [a+b for a in out for b in x[0]], x[1]
               continue
       if c == '\\' and len(s) > 1:
           s, c = s[1:], c + s[1]

       out, s = [a+c for a in out], s[1:]

   return out,s

def getgroup(s, depth):

   out, comma = [], False
   while s:
       g,s = getitem(s, depth)
       if not s: break
       out += g

       if s[0] == '}':
           if comma: return out, s[1:]
           return ['{' + a + '}' for a in out], s[1:]

       if s[0] == ',':
           comma,s = True, s[1:]

   return None

stolen cowbells from perl6 example

for s in ~/{Downloads,Pictures}/*.{jpg,gif,png} It{{em,alic}iz,erat}e{d,}, please. {,{,gotta have{ ,\, again\, }}more }cowbell! {}} some }{,{\\\\{ edge, edge} \,}{ cases, {here} \\\\\\\\\}.split('\n'):

   print "\n\t".join([s] + getitem(s)[0]) + "\n"</lang>

Output:

~/{Downloads,Pictures}/*.{jpg,gif,png}
        ~/Downloads/*.jpg
        ~/Downloads/*.gif
        ~/Downloads/*.png
        ~/Pictures/*.jpg
        ~/Pictures/*.gif
        ~/Pictures/*.png

It{{em,alic}iz,erat}e{d,}, please.
        Itemized, please.
        Itemize, please.
        Italicized, please.
        Italicize, please.
        Iterated, please.
        Iterate, please.

{,{,gotta have{ ,\, again\, }}more }cowbell!
        cowbell!
        more cowbell!
        gotta have more cowbell!
        gotta have\, again\, more cowbell!

{}} some }{,{\\{ edge, edge} \,}{ cases, {here} \\\\\}
        {}} some }{,{\\ edge \,}{ cases, {here} \\\\\}
        {}} some }{,{\\ edge \,}{ cases, {here} \\\\\}