Longest common subsequence: Difference between revisions

Content added Content deleted
m (Reformatted References)
(Undo revision 344177 by CNHume ([[Userp formatting)
Line 1: Line 1:
{{task}}[[Category:Recursion]][[Category:Memoization]]
{{task}}[[Category:Recursion]][[Category:Memoization]]
=== Introduction ===

Define a subsequence to be any string obtained by deleting zero or more symbols from an input string.
Define a subsequence to be any string obtained by deleting zero or more symbols from an input string.


The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] (or '''LCS''') is a subsequence of maximum length common to two (or more) strings.
The '''Longest Common Subsequence''' (or [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''LCS''']) is a subsequence of maximum length common to two (or more) strings.


Let A = A[0]… A[m-1] and B = B[0]… B[n-1], m <= n be strings drawn from an alphabet '''Σ''' of size s, containing every distinct symbol in A + B.
Let A = A[0]… A[m-1] and B = B[0]… B[n-1], m <= n be strings drawn from an alphabet '''Σ''' of size s, containing every distinct symbol in A + B.
Line 12: Line 10:
Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2.
Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2.


If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are ''incomparable''. Defining (#) to denote this case, we can write m1 # m2.
If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be ''incomparable''. Defining (#) to denote this case, we can write m1 # m2.


Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''.
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''.
Line 22: Line 20:
Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True.
Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True.


=== Background ===
'''Background'''


Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.
Line 36: Line 34:
Claus Rick has also described a linear-space algorithm with a time bound of O(n*s + min(p*m, p*(n-p))).
Claus Rick has also described a linear-space algorithm with a time bound of O(n*s + min(p*m, p*(n-p))).


=== References ===
'''References'''

"A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975<br />
Communications of the ACM [Volume 18, Number 6, pp. 341–343]

"An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976<br />
Computing Science Technical Report, Bell Laboratories 41

"A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977<br />
Communications of the ACM [Volume 20, Number 5, pp. 350-353]


# "A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975 Communications of the ACM [Volume 18, Number 6, pp. 341–343]
"Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters,<br />
Elsevier Science [Volume 75, pp. 275–281]
# "An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976 Computing Science Technical Report, Bell Laboratories 41
# "A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977 Communications of the ACM [Volume 20, Number 5, pp. 350-353]
# "Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters, Elsevier Science [Volume 75, pp. 275–281]


=== Examples ===
'''Examples'''


The sequences "1234" and "1224533324" have an LCS of "1234":
The sequences "1234" and "1224533324" have an LCS of "1234":