Bioinformatics/Global alignment

Global alignment is designed to search for highly similar regions in two or more DNA sequences, where the sequences appear in the same order and orientation, fitting the sequences in as pieces in a puzzle.

Current DNA sequencers find the sequence for multiple small segments of DNA which have mostly randomly formed by splitting a much larger DNA molecule into shorter segments. When re-assembling such segments of DNA sequences into a larger sequence to form, for example, the DNA coding for the relevant gene, the overlaps between multiple shorter sequences are commonly used to decide how the longer sequence is to be assembled. For example, "AAGATGGA", GGAGCGCATC", and "ATCGCAATAAGGA" can be assembled into the sequence "AAGATGGAGCGCATCGCAATAAGGA" by noting that "GGA" is at the tail of the first string and head of the second string and "ATC" likewise is at the tail of the second and head of the third string.

When looking for the best global alignment in the output strings produced by DNA sequences, there are typically a large number of such overlaps among a large number of sequences. In such a case, the ordering that results in the shortest common superstring is generrally preferred.

Finding such a supersequence is an NP-hard problem, and many algorithms have been proposed to shorten calculations, especially when many very long sequences are matched.

The shortest common superstring as used in bioinfomatics here differs from the string task Shortest_common_supersequence. In that task, a supersequence may have other characters interposed as long as the characters of each subsequence appear in order, so that (abcbdab, abdcaba) -> abdcabdab. In this task, (abcbdab, abdcaba) -> abcbdabdcaba.

Task

Given N non-identical strings of characters A, C, G, and T representing N DNA sequences, find the shortest DNA sequence containing all N sequences.

Handle cases where two sequences are identical or one sequence is entirely contained in another.

Print the resulting sequence along with its size (its base count) and a count of each base in the sequence.

Find the shortest common superstring for the following four examples:

("TA", "AAG", "TA", "GAA", "TA")

("CATTAGGG", "ATTAG", "GGG", "TA")

("AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA")

("ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT",
"GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT",
"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
"GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC",
"CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT",
"TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC",
"GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT",
"TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC",
"CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
"TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA")

Related tasks

Julia

<lang julia>using Combinatorics

""" Given a DNA sequence, report the sequence, length and base counts""" function printcounts(seq)

   bases = [['A', 0], ['C', 0], ['G', 0], ['T', 0]]
   for c in seq, base in bases
       if c == base[1]
           base[2] += 1
       end
   end
   println("\nNucleotide counts for $seq:\n")
   for base in bases
       println(lpad(base[1], 10), lpad(string(base[2]), 12))
   end
   println(lpad("Other", 10), lpad(string(length(seq) - sum(x[2] for x in bases)), 12))
   println("     _________________\n", lpad("Total length", 14), lpad(string(length(seq)), 8))

end

"""Return the number of chars of overlap of tail of string s1 with head of string s2""" function headtailoverlap(s1, s2, minimumoverlap=1)

   start = 1
   while true
       range = findnext(s2[1:minimumoverlap], s1, start)
       range == nothing && return 0
       start = range.start
       startswith(s2, s1[start:end]) && return length(s1) - start + 1
       start += 1
   end

end

"""Remove duplicates and strings contained within a larger string from vector of strings""" function deduplicate(svect)

   filtered = empty(svect)
   arr = unique(svect)
   for (i, s1) in enumerate(arr)
       any(p -> p[1] != i && occursin(s1, p[2]), enumerate(arr)) && continue
       push!(filtered, s1)
   end
   return filtered

end

"""Returns shortest common superstring of a vector of strings""" function shortest_common_superstring(svect)

   ss = deduplicate(svect)
   shortestsuper = prod(ss)
   for perm in permutations(ss)
       sup = first(perm)
       for i in 1:length(ss)-1
           overlaplen = headtailoverlap(perm[i], perm[i+1], 1)
           sup *= perm[i + 1][overlaplen+1:end]
       end
       if length(sup) < length(shortestsuper)
           shortestsuper = sup
       end
   end
   return shortestsuper

end

testsequences = [ ["TA", "AAG", "TA", "GAA", "TA"], ["CATTAGGG", "ATTAG", "GGG", "TA"], ["AAGAUGGA", "GGAGCGCAUC", "AUCGCAAUAAGGA"], ["ATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTAT", "GGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGT", "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA", "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC", "AACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT", "GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC", "CGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCT", "TGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC", "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGC", "GATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATT", "TTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATC", "CTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA", "TCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGA"] ]

for test in testsequences

   scs = shortest_common_superstring(test)
   printcounts(scs)

end

</lang>

Output:

Nucleotide counts for TAAGAA:

         A           4
         C           0
         G           1
         T           1
     Other           0
     _________________
  Total length       6

Nucleotide counts for CATTAGGG:

         A           2
         C           1
         G           3
         T           2
     Other           0
     _________________
  Total length       8

Nucleotide counts for AAGAUGGAGCGCAUCGCAAUAAGGA:

         A          10
         C           4
         G           8
         T           0
     Other           3
     _________________
  Total length      25

Nucleotide counts for CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA:

         A          74
         C          57
         G          75
         T          94
     Other           0
     _________________
  Total length     300