Bioinformatics/Subsequence

Revision as of 23:02, 23 March 2021 by Petelomax (talk | contribs) (→‎{{header|Ring}}: video link is still private)

Randomly generate a string of   200   DNA bases   (represented by   A,  C,  G,  and  T).

Bioinformatics/Subsequence is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
Task

Write a routine to find all the positions of a randomly generated subsequence   (four letters).

Factor

Works with: Factor version 0.99 2021-02-05

<lang factor>USING: accessors formatting grouping io kernel math math.functions.integer-logs math.parser random regexp sequences ;

new-dna ( n -- str ) [ "ACGT" random ] "" replicate-as ;
pad ( n d -- str ) [ number>string ] dip 32 pad-head ;
.dna ( seq n -- )
   seq length integer-log10 1 + :> d seq n group
   [ n * d pad write ": " write write nl ] each-index ;
.match ( slice -- ) [ from>> ] [ to>> ] bi "%d..%d\n" printf ;
.matches ( slices -- )
   "Matches found at the following indices:" print
   [ .match ] each ;
.locate ( slices -- )
   [ "No matches found." print ] [ .matches ] if-empty ;
.biosub ( dna-size row-size -- )
   [ new-dna dup ] [ .dna nl ] bi*
   4 new-dna dup "Subsequence to locate: %s\n" printf
   <regexp> all-matching-slices .locate ;

80 10 .biosub nl 600 39 .biosub nl</lang>

Output:
 0: ATTCAAGGAC
10: CACTATTAAC
20: CTGCATTGTG
30: AGAACTTGCA
40: GTGTACCGAG
50: AGCGAGTTTA
60: AAGCAACACA
70: TCTTTACCGA

Subsequence to locate: GTAG
No matches found.

  0: GATCTCGTCATGGTCCATCCTAACATTTCGGTTGTGGGC
 39: GCATCCCGATAGGCGAAGTTAAATCTACGTAGTCCTACG
 78: TCACGACGGAACATGATTGCCCACCGAAGTCGTAGGCGA
117: GCTAAAGTCGGTACATACACGATCTGCTATATTCGTTCT
156: CCGACACACGACATGCAATCCGAGAAGCTCTCGAAGTGC
195: GGTCAGATCCTCAGACTCGAACAGAGGAGACCTTAACTG
234: ATACCCACAGTACTTCTCGCATAACCTAAGCACCTATGC
273: TTACACCATCGTCCTGATATTGAGTGAGTCTGGTCGGAG
312: ATATTATCTAGCACCCTCAAGCTCTGTGTGCCACACCAG
351: GATTCCACTTCGCGCTTGCCTAGAGAAAGTAGAGTAGGT
390: GGTGTCATTAGTACACTGTTTGCGATGCACCAACCAAAC
429: CCGACCGCCATGATGACTGCTTTTCGGCCAACGTCAGAT
468: TAAGAGTACTTTTAGTAGCACCGCAAGCCAGCCGGTTTA
507: GCAAGATCCTGCAGCCTCCACGTTATTTCAGGTCTCTAA
546: GCGTTCTTTCCATGGAAGTAGTCACCGCTCCCGTTGCCA
585: ATGGACACAGACGTT

Subsequence to locate: ATAT
Matches found at the following indices:
145..149
289..293
312..316

Julia

<lang julia>DNArand(n, bases=['A', 'T', 'C', 'G']) = String(rand(bases, n))

DNAsearch(needle, haystack, lap=true) = findall(needle, haystack, overlap=lap)

const rand_string = DNArand(200) const subseq = DNArand(4)

println("Search sequence:\n$rand_string\nfor substring $subseq. Found at positions: ") foreach(p -> print(rpad(p[2], 8), p[1] % 10 == 0 ? "\n" : ""), enumerate(DNAsearch(subseq, rand_string)))

</lang>

Output:
Search sequence:
CCGAAGCCAGGAGGACTGAGCGCTTGCGTCCCGAGTTCTGCGACGAGTCTCTTCATTATAAGGCCACTGATTGCGCTCATCATGAGTGCCAGAAGCACCGCTAAACATAAGTGTCCTTTCTTCCTGACGCACTTGAAGATTGTGACCATTTGTGCGGGTTGTGAGTTAGGGGCTCTCATTGTACACGATCTATAGTGTGC
for substring CGCT. Found at positions:
21:24   74:77   99:102

Phix

Note: match_all() is due to become a builtin in the next release, so the version below may or may not need renaming/deleting before it will run.
Currently only searches for non-overlapped sequences, but it should be pretty obvious how to change that, in which case the next underline will simply partially overwrite the previous, so you'll get eg "<=<==>".

constant cheat = false

function grandna(integer len)
    string dna = repeat(' ',len)
    for i=1 to len do dna[i] = "ACGT"[rand(4)] end for
    return dna
end function

procedure show(string dna, sequence idx)
    idx &= length(dna)+100 -- (add an otherwise unused sentinel)
    sequence s = split(trim(join_by(split(join_by(dna,1,10,""),"\n"),1,5," ")),"\n")
    integer ii = 1,         -- idx index
            i = idx[ii],    -- current target
            ux = 1,         -- underline index (1..4)
            ldx = 1         -- line index (1, 51, 101, etc)
    for si=1 to length(s) do
        printf(1,"%3d: %s\n",{ldx,s[si]})
        ldx += 50
        if i and i<ldx then
            string ul = repeat(' ',59)
            while i and i<ldx do
                integer up = i-ldx+51       -- underline pos (relative to ldx)
                up += floor((up-1)/10)+5    -- (plus any needed spacing)
                ul[up] = "<==>"[ux]
                ux += 1
                i += 1
                if ux>4 then
                    ux = 1
                    ii += 1
                    i = idx[ii]
                end if
            end while
            printf(1,"%s\n",ul)
        end if
    end for
    if length(idx) then
        string s = iff(length(idx)>1?"s":""),
               t = join(apply(idx,sprint),", ")
        printf(1,"%s occurs at location%s: %s\n",{test,s,t})
    else
        printf(1,"%s does not occur\n",{test})
    end if
end procedure

function match_all(object needle, sequence haystack, bool bOverlap = false)
    if atom(needle) then return find_all(needle,haystack) end if
    integer start = 1
    sequence res = {}
    while 1 do
        start = match(needle,haystack,start)
        if start=0 then exit end if
        res = append(res,start)
        start += iff(bOverlap?1:length(needle))
    end while
    return res
end function
 
string dna = grandna(200),
       test = grandna(4)
constant cheats = iff(cheat?{9,13,49,60,64,68}:{})
for i=1 to length(cheats) do
    dna[cheats[i]..cheats[