Bioinformatics/base count: Difference between revisions

From Rosetta Code
Content added Content deleted
m (→‎{{header|REXX}}: added a REXX stub.)
m (→‎{{header|REXX}}: added the REXX computer programming language for this task.)
Line 71: Line 71:


=={{header|REXX}}==
=={{header|REXX}}==
<lang rexx>/*REXX program finds the number of each base in a DNA string (along with a total). */
parse arg dna .
if dna=='' | dna=="," then dna= CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG ,
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG ,
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT ,
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT ,
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG ,
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA ,
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT ,
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG ,
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC ,
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
dna= space(dna, 0); upper dna /*elide blanks from DNA; uppercase it. */
say '────────length of the DNA string: ' length(dna)
@.=0 /*initialize the count for all bases. */
w= 1 /*the maximum width of a base count. */
$= /*a placeholder for the names of bases.*/
do j=1 for length(dna) /*traipse through the DNA string. */
_= substr(dna, j, 1) /*obtain a base name from the DNA str. */
if pos(_, $)==0 then $=$ || _ /*if not found before, add it to list. */
@._= @._ + 1 /*bump the count of this base. */
w= max(w, length(@._) ) /*compute the maximum width number. */
end /*j*/
say
do k=0 for 255; z= d2c(k) /*traipse through all possibilities. */
if pos(z, $)==0 then iterate /*Was this base found? No, then skip. */
say ' base ' z " has a basecount of: " right(@.z, w)
@.tot= @.tot + @.z /*add to a grand total to verify count.*/
end /*k*/
say /*stick a fork in it, we're all done. */
say '────────total for all basecounts:' right(@.tot, w+1)</lang>
{{out|output|text=&nbsp; when using the default input:}}
<pre>
────────length of the DNA string: 500

base A has a basecount of: 129
base C has a basecount of: 97
base G has a basecount of: 119
base T has a basecount of: 155

────────total for all basecounts: 500
</pre>

Revision as of 21:58, 25 November 2019

Bioinformatics/base count is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Given this string representing ordered DNA bases:

CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
  1. "Pretty print" the sequence followed by a summary of the counts of each of the bases, (A, C, G, and T) in the sequence as well as the total count of bases in the string.

Python

<lang python>from collections import Counter

def basecount(dna):

   return sorted(Counter(dna).items())

def seq_split(dna, n=50):

   return [dna[i: i+n] for i in range(0, len(dna), n)]

def seq_pp(dna, n=50):

   for i, part in enumerate(seq_split(dna, n)):
       print(f"{i*n:>5}: {part}")
   print("\n  BASECOUNT:")
   tot = 0
   for base, count in basecount(dna):
       print(f"    {base:>3}: {count}")
       tot += count
   base, count = 'TOT', tot
   print(f"    {base:>3}= {count}")
   

if __name__ == '__main__':

   print("SEQUENCE:")
   sequence = \

CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\ CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\ AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\ GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\ CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\ TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\ TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\ CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\ TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\ GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT

   seq_pp(sequence)

</lang>

Output:
SEQUENCE:
    0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
   50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
  100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
  150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
  200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
  250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
  300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
  350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
  400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
  450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT

  BASECOUNT:
      A: 129
      C: 97
      G: 119
      T: 155
    TOT= 500

REXX

<lang rexx>/*REXX program finds the number of each base in a DNA string (along with a total). */ parse arg dna . if dna== | dna=="," then dna= CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG ,

                                CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG ,
                                AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT ,
                                GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT ,
                                CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG ,
                                TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA ,
                                TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT ,
                                CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG ,
                                TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC ,
                                GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT

dna= space(dna, 0); upper dna /*elide blanks from DNA; uppercase it. */ say '────────length of the DNA string: ' length(dna) @.=0 /*initialize the count for all bases. */ w= 1 /*the maximum width of a base count. */ $= /*a placeholder for the names of bases.*/

      do j=1  for length(dna)                   /*traipse through the  DNA  string.    */
      _= substr(dna, j, 1)                      /*obtain a base name from the DNA str. */
      if pos(_, $)==0  then $=$ || _            /*if not found before, add it to list. */
      @._= @._ + 1                              /*bump the count of this base.         */
      w= max(w, length(@._) )                   /*compute the maximum width number.    */
      end   /*j*/

say

      do k=0  for 255;   z= d2c(k)              /*traipse through all possibilities.   */
      if pos(z, $)==0  then iterate             /*Was this base found?  No, then skip. */
      say '     base '   z    " has a basecount of: "   right(@.z, w)
      @.tot= @.tot + @.z                        /*add to a grand total to verify count.*/
      end   /*k*/

say /*stick a fork in it, we're all done. */ say '────────total for all basecounts:' right(@.tot, w+1)</lang>

output   when using the default input:
────────length of the DNA string:  500

     base  A  has a basecount of:  129
     base  C  has a basecount of:   97
     base  G  has a basecount of:  119
     base  T  has a basecount of:  155

────────total for all basecounts:  500