FASTA format: Difference between revisions
(→{{header|Perl 6}}: Add Python.) |
m (→{{header|Perl 6}}: using .chomp) |
||
Line 55: | Line 55: | ||
=={{header|Perl 6}}== |
=={{header|Perl 6}}== |
||
Certainly not the most elegant way to do it, but that's a start: |
Certainly not the most elegant way to do it, but that's a start: |
||
<lang Perl 6>say "{.[0]}: {.[1]>>. |
<lang Perl 6>say "{.[0]}: {.[1]>>.chomp.join}" for |
||
">Rosetta_Example_1 |
">Rosetta_Example_1 |
||
THERECANBENOSPACE |
THERECANBENOSPACE |
Revision as of 03:42, 8 May 2013
In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.
Write a program that reads a FASTA file such as:
>Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
And prints the following output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
D
<lang d>import std.exception; import std.file; import std.stdio;
void main(string[] args) {
if (args.length < 2) { throw new Exception("You must specify a file."); } enforce(exists(args[1])); bool firstLine = false; foreach (line; File(args[1]).byLine) { if (line[0] == '>') { if (firstLine) { writeln(); } else { firstLine = true; } write(line[1..$] ~ ": "); } else { write(line); } } writeln();
}</lang>
Perl 6
Certainly not the most elegant way to do it, but that's a start: <lang Perl 6>say "{.[0]}: {.[1]>>.chomp.join}" for ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED".comb: / '>' (\N+)\n (<!before '>'>\N+\n?)+ /, :match</lang>
Python
I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory. <lang python>import io
FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
infile = io.StringIO(FASTA)
def fasta_parse(infile):
key = for line in infile: if line.startswith('>'): if key: yield key, val key, val = line[1:].rstrip().split()[0], elif key: val += line.rstrip() if key: yield key, val
print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
REXX
version 1
This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=$||x end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
output when using the default input file
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
version 2
This REXX version handles (see the talk page):
- blank lines
- sequences that end in an asterisk [*]
- sequences that contain blanks, tabs, and other whitespace
- sequence names that are identified with a semicolon [;]
<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if x== then iterate /*ignore blank lines. */ if left(x,1)==';' then do if name== then name=substr(x,2) say x iterate end if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=space($||translate(x,,'*'),0) end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
input The FASTA2.IN file is shown below:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
output when the FASTA2.IN input file is used:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
Run BASIC
<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"
i = 1 while i <= len(a$)
if mid$(a$,i,17) = ">Rosetta_Example_" then print print mid$(a$,i,18);": "; i = i + 17 else if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1); end if i = i + 1
wend</lang>
- Output:
>Rosetta_Example_1: THERECANBENOSPACE >Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Tcl
<lang tcl>proc fastaReader {filename} {
set f [open $filename] set sep "" while {[gets $f line] >= 0} {
if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }
} puts "" close $f
}
fastaReader ./rosettacode.fas</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED