Longest common subsequence: Difference between revisions

Line 1:

{{task}}[[Category:Recursion]][[Category:Memoization]]

=== Introduction ===

Define a subsequence to be any string obtained by deleting zero or more symbols from an input string.

The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem ~~'''Longest Common Subsequence'''] (or~~ '''LCS''') is a subsequence of maximum length common to two (or more) strings.

The '''Longest Common Subsequence''' (or [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''LCS''']) is a subsequence of maximum length common to two (or more) strings.

Let A = A[0]… A[m-1] and B = B[0]… B[n-1], m <= n be strings drawn from an alphabet '''Σ''' of size s, containing every distinct symbol in A + B.

Line 12:

Line 10:

Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2.

If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are ''incomparable''. Defining (#) to denote this case, we can write m1 # m2.

If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be ''incomparable''. Defining (#) to denote this case, we can write m1 # m2.

Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''.

Line 22:

Line 20:

Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True.

~~===~~ Background ~~===~~

'''Background'''

Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.

Line 36:

Line 34:

Claus Rick has also described a linear-space algorithm with a time bound of O(n*s + min(p*m, p*(n-p))).

~~===~~ References ~~===~~

'''References'''

"A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975<br />

Communications of the ACM [Volume 18, Number 6, pp. 341–343]

⚫

"An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976<br />

Computing Science Technical Report, Bell Laboratories 41

⚫

"A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977<br />

Communications of the ACM [Volume 20, Number 5, pp. 350-353]

# "A linear space ~~algorithm~~ ~~for~~ ~~computing maximal~~ common subsequences" by ~~Daniel~~ ~~S. Hirschberg~~, ~~published~~ ~~June~~ ~~1975~~ ~~Communications~~ of ~~the~~ ~~ACM [Volume 18~~, ~~Number 6, pp. 341–343]~~

"Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters,<br />

Elsevier Science [Volume 75, pp. 275–281]

⚫

# "An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976 ~~Computing Science Technical Report, Bell Laboratories 41~~

⚫

# "A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977 ~~Communications of the ACM [Volume 20, Number 5, pp. 350-353]~~

# "Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters, Elsevier Science [Volume 75, pp. 275–281]

~~===~~ Examples ~~===~~

'''Examples'''

The sequences "1234" and "1224533324" have an LCS of "1234":