I'm working on modernizing Rosetta Code's infrastructure. Starting with communications. Please accept this time-limited open invite to RC's Slack.. --Michael Mol (talk) 20:59, 30 May 2020 (UTC)

Talk:Bioinformatics/base count

From Rosetta Code

'Pretty-prints' shown match no standard DNA sequence format[edit]

As the Perl entry comments "The specs for what "pretty print" means are sadly lacking" here.

To raise it anywhere above the level of "the Letter frequency task all over again, just simpler and dressed up in different clothes.", it would need, at the very least, to define the expected output in terms of one of the DNA sequence formats that are widely used.

Does this 50-clumping with zero-based index prefixes have a defined provenance ? There is a list of standard formats at, for example, the Genomatix site, under the rubric: 'DNA Sequence Formats'.

If you are after clumping and index prefixes, then perhaps the 'pretty printing' needs to be tidied into, and defined as, a translation from FASTA or 'plain sequence' to EMBL, GCG, or GenBank. Hout (talk) 14:35, 26 November 2019 (UTC)

rare or minor bases[edit]

I wrote the REXX version to allow other bases,   which would also allow the program to catch typos   (as they would show up in the counts, whether valid or not).

─────────────────────────── from elsewhere on the 'net:

Common Bases:

   A = Adenine 
   C = Cytosine 
   G = Guanine 
   T = Thymine 

Rare or Minor Bases:

In addition to four common bases (ATGC), certain other unusual bases of purine and pyrimidine derivatives, called rare or minor bases, occur in small amounts in some DNA. In some viruses uracil occurs in place of thymine in DNA.

-- Gerard Schildberger (talk) 22:13, 25 November 2019 (UTC)

The Python just counts the character types too, and doesn't check that they are ACG or T only - easier to code, and enough for this task, I thought. --Paddy3118 (talk) 22:29, 25 November 2019 (UTC)
Yuppers, that way, you'd catch typos easier.   I would've made that part of the requirements, after all, some of those DNA strings are ginormous, and if a site is using an optical reader,   errors are easy to creep in.     -- Gerard Schildberger (talk) 22:35, 25 November 2019 (UTC)
To raise this above a reclothing of the "Letter frequency task", it should really, at the very least, maintain a count of unexpected symbols. Hout (talk) 14:35, 26 November 2019 (UTC)

base count[edit]

In the task's preamble,   it is stated in the task requirement   #1:

   ...  as well as the total count of bases in the string. 

I could tell just by a quick look at the DNA string.     There were (count 'em) four bases in the string.   English is fun.

Is there an upcoming   #2   bullet point?     -- Gerard Schildberger (talk) 00:07, 26 November 2019 (UTC)

... But only one Count Basie.

--Paddy3118 (talk) 19:42, 26 November 2019 (UTC)