FASTA format
In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.
Write a program that reads a FASTA file such as:
>Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
And prints the following output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
AutoHotkey
<lang AutoHotkey>Data = ( >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )
Data := RegExReplace(RegExReplace(Data, ">\V+\K\v+", ": "), "\v+(?!>)") Gui, add, Edit, w700, % Data Gui, show return</lang>
Outputs:
>Rosetta_Example_1: THERECANBENOSPACE >Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
C++
<lang cpp>#include <iostream>
- include <fstream>
int main( int argc, char **argv ){
if( argc <= 1 ){ std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl; return -1; }
std::ifstream input(argv[1]); if(!input.good()){ std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl; return -1; }
std::string line, name, content; while( std::getline( input, line ).good() ){ if( line.empty() || line[0] == '>' ){ // Identifier marker if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; name.clear(); } if( !line.empty() ){ name = line.substr(1); } content.clear(); } else if( !name.empty() ){ if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed name.clear(); content.clear(); } else { content += line; } } } if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; } return 0;
}</lang>
- Output:
Rosetta_Example_1 : THERECANBENOSPACE Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
D
<lang d>import std.stdio, std.string;
void main() {
immutable fileName = "fasta_format_data.fasta";
bool first = true;
foreach (const line; fileName.File.byLine) { if (line[0] == '>') { if (first) { first = false; } else { writeln; }
write(line[1 .. $].strip, ": "); } else { line.strip.write; } }
writeln;
}</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Go
<lang go>package main
import (
"bufio" "fmt" "os"
)
func main() {
f, err := os.Open("rc.fasta") if err != nil { fmt.Println(err) return } defer f.Close() s := bufio.NewScanner(f) headerFound := false for s.Scan() { line := s.Text() switch { case line == "": continue case line[0] != '>': if !headerFound { fmt.Println("missing header") return } fmt.Print(line) case headerFound: fmt.Println() fallthrough default: fmt.Printf("%s: ", line[1:]) headerFound = true } } if headerFound { fmt.Println() } if err := s.Err(); err != nil { fmt.Println(err) }
}</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
J
Needs chunking to handle huge files. <lang j>require 'strings' NB. not needed for J versions greater than 6. parseFasta=: ((': ' ,~ LF&taketo) , (LF -.~ LF&takeafter));._1</lang> Example Usage <lang j> Fafile=: noun define >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )
parseFasta Fafile
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</lang>
Java
<lang java>import java.io.*; import java.util.Scanner;
public class ReadFastaFile {
public static void main(String[] args) throws FileNotFoundException {
boolean first = true;
try (Scanner sc = new Scanner(new File("test.fasta"))) { while (sc.hasNextLine()) { String line = sc.nextLine().trim(); if (line.charAt(0) == '>') { if (first) first = false; else System.out.println(); System.out.printf("%s: ", line.substring(1)); } else { System.out.print(line); } } } System.out.println(); }
}</lang>
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED Rosetta_Example_3: THISISFASTA
Perl 6
<lang perl6>grammar FASTA {
rule TOP { <entry>+ } rule entry { \> <title> <sequence> } token title { <.alnum>+ } token sequence { ( <.alnum>+ )+ % \n { make $0.join } }
}
FASTA.parse: q:to /§/; >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED §
for $/<entry>[] {
say ~.<title>, " : ", .<sequence>.made;
}</lang>
- Output:
Rosetta_Example_1 : THERECANBENOSPACE Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Python
I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory. <lang python>import io
FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
infile = io.StringIO(FASTA)
def fasta_parse(infile):
key = for line in infile: if line.startswith('>'): if key: yield key, val key, val = line[1:].rstrip().split()[0], elif key: val += line.rstrip() if key: yield key, val
print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Racket
<lang racket>
- lang racket
(let loop ([m #t])
(when m (when (regexp-try-match #rx"^>" (current-input-port)) (unless (eq? #t m) (newline)) (printf "~a: " (read-line))) (loop (regexp-match #rx"\n" (current-input-port) 0 #f (current-output-port)))))
(newline) </lang>
REXX
version 1
This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=$||x end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
output when using the default input file
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
version 2
This REXX version handles (see the talk page):
- blank lines
- sequences that end in an asterisk [*]
- sequences that contain blanks, tabs, and other whitespace
- sequence names that are identified with a semicolon [;]
<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if x== then iterate /*ignore blank lines. */ if left(x,1)==';' then do if name== then name=substr(x,2) say x iterate end if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=space($||translate(x,,'*'),0) end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
input The FASTA2.IN file is shown below:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
output when the FASTA2.IN input file is used:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
Ruby
<lang ruby>def fasta_format(strings)
out, text = [], "" strings.split("\n").each do |line| if line[0] == '>' out << text unless text.empty? text = line[1..-1] + ": " else text << line end end out << text unless text.empty?
end
data = <<'EOS' >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED EOS
puts fasta_format(data)</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Run BASIC
<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"
i = 1 while i <= len(a$)
if mid$(a$,i,17) = ">Rosetta_Example_" then print print mid$(a$,i,18);": "; i = i + 17 else if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1); end if i = i + 1
wend</lang>
- Output:
>Rosetta_Example_1: THERECANBENOSPACE >Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Tcl
<lang tcl>proc fastaReader {filename} {
set f [open $filename] set sep "" while {[gets $f line] >= 0} {
if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }
} puts "" close $f
}
fastaReader ./rosettacode.fas</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
zkl
<lang zkl>fcn fasta(data){ // a lazy cruise through a FASTA file
fcn(w){ // one string at a time, -->False garbage at front of file line:=w.next().strip(); if(line[0]==">") w.pump(line[1,*]+": ",'wrap(l){ if(l[0]==">") { w.push(l); Void.Stop } else l.strip() }) }.fp(data.walker()) : Utils.Helpers.wap(_);
}</lang>
- This assumes that white space at front or end of string is extraneous (excepting ">" lines).
- Lazy, works for objects that support iterating over lines (ie most).
- The fasta function returns an iterator that wraps a function taking an iterator. Uh, yeah. An initial iterator (Walker) is used to get lines, hold state and do push back when read the start of the next string. The function sucks up one string (using the iterator). The wrapping iterator (wap) traps the exception when the function waltzes off the end of the data and provides API for foreach (etc).
FASTA file: <lang zkl>foreach l in (fasta(File("fasta.txt"))) { println(l) }</lang> FASTA data blob: <lang zkl>data:=Data(0,String,
">Rosetta_Example_1\nTHERECANBENOSPACE\n" ">Rosetta_Example_2\nTHERECANBESEVERAL\nLINESBUTTHEYALLMUST\n" "BECONCATENATED");
foreach l in (fasta(data)) { println(l) }</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED