Suffix tree: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:52, 26 August 2013

A suffix tree is a data structure commonly used in string algorithms.

Given a string S of length n, its suffix tree is a tree T such that:

T has exactly n leaves numbered from 1 to n.
Except for the root, every internal node has at least two children.
Each edge of T is labelled with a non-empty substring of S.
No two edges starting out of a node can have string labels beginning with the same character.
The string obtained by concatenating all the string labels found on the path from the root to leaf i spells out suffix S[i..n], for i from 1 to n.

Such a tree does not exist for all strings. To ensure existence, a character that is not found in S must be appended at its end. The character '$' is traditionally used for this purpose.

For this task, build and display the suffix tree of the string "banana$". Displaying the tree can be done using the code from the visualize a tree task, but any other convenient method is accepted.

There are several ways to implement the tree data structure, for instance how edges should be labelled. Latitude is given in this matter, but notice that a simple way to do it is to label each node with the label of the edge leading to it.

The computation time for an efficient algorithm should be $O(n)$ , but such an algorithm might be difficult to implement. An easier, $O(n^{2})$ algorithm is accepted.

Perl

Translation of: Perl 6

<lang Perl>use strict; use warnings; use Data::Dumper;

sub classify {

   my ($f, $h) = (shift, {});
   for (@_) { push @{$h->{$f->($_)}}, $_ }
   return $h;

} sub suffixes {

   my $str = shift;
   map { substr $str, $_ } 0 .. length($str) - 1;

} sub suffix_tree {

   return +{} if @_ == 0;
   return +{ $_[0] => +{} } if @_ == 1;
   my $h = {};
   my $classif = classify sub { substr shift, 0, 1 }, @_;
   for my $key (sort keys %$classif) {
       my $subtree = suffix_tree(
           grep "$_", map { substr $_, 1 } @{$classif->{$key}}
       );
       my @subkeys = keys %$subtree;
       if (@subkeys == 1) {
           my $subkey = shift @subkeys;
           $h->{"$key$subkey"} = $subtree->{$subkey};
       } else { $h->{$key} = $subtree }
   }
   return $h;

}

print +Dumper suffix_tree suffixes 'banana$';</lang>

Output:

$VAR1 = {
          '$' => {},
          'a' => {
                   '$' => {},
                   'na' => {
                             'na$' => {},
                             '$' => {}
                           }
                 },
          'banana$' => {},
          'na' => {
                    'na$' => {},
                    '$' => {}
                  }
        };

Perl 6

Here is quite a naive algorithm, probably $O(n^{2})$ .

<lang Perl 6>multi suffix-tree(Str $str) { suffix-tree map &flip, [\~] $str.flip.comb } multi suffix-tree(@a) {

   hash
   @a == 0 ?? () !!
   @a == 1 ?? @a[0] => [] !!
   gather for @a.classify(*.substr(0, 1)) {
       my $subtree = suffix-tree(grep *.chars, map *.substr(1), .value[]);
       if $subtree == 1 {
           my $pair = $subtree.pick;
           take .key ~ $pair.key => $pair.value;
       } else {
           take .key => $subtree;
       }
   }

}</lang>

Displaying the tree is done with the code from visualize a tree: <lang Perl 6>my $tree = root => suffix-tree 'banana$'; .say for visualize-tree $tree, *.key, *.value.list;</lang>

Output:

root
├─$
├─a
│ ├─$
│ └─na
│   ├─$
│   └─na$
├─na
│ ├─$
│ └─na$
└─banana$

Racket

See Suffix trees with Ukkonen’s algorithm by Danny Yoo for more information on how to use suffix trees in Racket.

lang racket

(require (planet dyoo/suffixtree)) (define tree (make-tree)) (tree-add! tree (string->label "rosettacode$")) (for ([i (in-naturals)]

     [c (node-children (tree-root tree))])
 (printf "~a: ~a\n" i (label->string (node-up-label c))))

</lang> Output: <lang racket> 0: $ 1: e 2: de$ 3: o 4: code$ 5: acode$ 6: t 7: settacode$ 8: rosettacode$ </lang>

@@ Line 1: / Line 1: @@
 {{draft task}}
+[[Category:String algorithms]]
 A [[wp:suffix tree|suffix tree]] is a data structure commonly used in [[wp:string algorithms|string algorithms]].