FASTA format
In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.
Write a program that reads a FASTA file such as:
>Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
And prints the following output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
D
<lang d> import std.exception; import std.file; import std.stdio;
void main(string[] args) {
if (args.length < 2) { throw new Exception("You must specify a file."); } enforce(exists(args[1])); bool firstLine = false; foreach (line; File(args[1]).byLine) { if (line[0] == '>') { if (firstLine) { writeln(); } else { firstLine = true; } write(line[1..$] ~ ": "); } else { write(line); } } writeln();
} </lang>
Perl 6
Certainly not the most elegant way to do it, but that's a start: <lang Perl 6>say "{.[0]}: {.[1]>>.comb(/\N+/).join}" for ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED".comb: / '>' (\N+)\n (<!before '>'>\N+\n?)+ /, :match</lang>
REXX
version 1
This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=$||x end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
output when using the default input file
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
version 2
This REXX version handles (see the talk page):
- blank lines
- sequences that end in an asterisk [*]
- sequences that contain blanks, tabs, and other whitespace
- sequence names that are identified with a semicolon [;]
<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if x== then iterate /*ignore blank lines. */ if left(x,1)==';' then do if name== then name=substr(x,2) say x iterate end if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=space($||translate(x,,'*'),0) end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
input The FASTA2.IN file is shown below:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
output when the FASTA2.IN input file is used:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
Tcl
<lang tcl>proc fastaReader {filename} {
set f [open $filename] set sep "" while {[gets $f line] >= 0} {
if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }
} puts "" close $f
}
fastaReader ./rosettacode.fas</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED