FASTA format: Difference between revisions

Content added Content deleted

Inline

Revision as of 23:23, 17 May 2015

In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.

Write a program that reads a FASTA file such as:

>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.

AutoHotkey

<lang AutoHotkey>Data = ( >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )

Data := RegExReplace(RegExReplace(Data, ">\V+\K\v+", ": "), "\v+(?!>)") Gui, add, Edit, w700, % Data Gui, show return</lang>

Output:

>Rosetta_Example_1: THERECANBENOSPACE
>Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

AWK

syntax: GAWK -f FASTA_FORMAT.AWK filename
stop processing each file when an error is encountered

{ if (FNR == 1) {

     header_found = 0
     if ($0 !~ /^[;>]/) {
       error("record is not valid")
       nextfile
     }
   }
   if ($0 ~ /^;/) { next } # comment begins with a ";"
   if ($0 ~ /^>/) { # header
     if (header_found > 0) {
       printf("\n") # EOL for previous sequence
     }
     printf("%s: ",substr($0,2))
     header_found = 1
     next
   }
   if ($0 ~ /[ \t]/) { next } # ignore records with whitespace
   if ($0 ~ /\*$/) { # sequence may end with an "*"
     if (header_found > 0) {
       printf("%s\n",substr($0,1,length($0)-1))
       header_found = 0
       next
     }
     else {
       error("end of sequence found but header is missing")
       nextfile
     }
   }
   if (header_found > 0) {
     printf("%s",$0)
   }
   else {
     error("header not found")
     nextfile
   }

} ENDFILE {

   if (header_found > 0) {
     printf("\n")
   }

} END {

   exit (errors == 0) ? 0 : 1

} function error(message) {

   printf("error: FILENAME=%s, FNR=%d, %s, %s\n",FILENAME,FNR,message,$0) >"con"
   errors++
   return

} </lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

C++

<lang cpp>#include <iostream>

include <fstream>

int main( int argc, char **argv ){

   if( argc <= 1 ){
       std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
       return -1;
   }

   std::ifstream input(argv[1]);
   if(!input.good()){
       std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
       return -1;
   }

   std::string line, name, content;
   while( std::getline( input, line ).good() ){
       if( line.empty() || line[0] == '>' ){ // Identifier marker
           if( !name.empty() ){ // Print out what we read from the last entry
               std::cout << name << " : " << content << std::endl;
               name.clear();
           }
           if( !line.empty() ){
               name = line.substr(1);
           }
           content.clear();
       } else if( !name.empty() ){
           if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
               name.clear();
               content.clear();
           } else {
               content += line;
           }
       }
   }
   if( !name.empty() ){ // Print out what we read from the last entry
       std::cout << name << " : " << content << std::endl;
   }
   
   return 0;

}</lang>

Output:

Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

D

<lang d>import std.stdio, std.string;

void main() {

   immutable fileName = "fasta_format_data.fasta";

   bool first = true;

   foreach (const line; fileName.File.byLine) {
       if (line[0] == '>') {
           if (first) {
               first = false;
           } else {
               writeln;
           }

           write(line[1 .. $].strip, ": ");
       } else {
           line.strip.write;
       }
   }

   writeln;

}</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Go

<lang go>package main

import (

       "bufio"
       "fmt"
       "os"

)

func main() {

       f, err := os.Open("rc.fasta")
       if err != nil {
               fmt.Println(err)
               return
       }
       defer f.Close()
       s := bufio.NewScanner(f)
       headerFound := false
       for s.Scan() {
               line := s.Text()
               switch {
               case line == "":
                       continue
               case line[0] != '>':
                       if !headerFound {
                               fmt.Println("missing header")
                               return
                       }
                       fmt.Print(line)
               case headerFound:
                       fmt.Println()
                       fallthrough
               default:
                       fmt.Printf("%s: ", line[1:])
                       headerFound = true
               }
       }
       if headerFound {
               fmt.Println()
       }
       if err := s.Err(); err != nil {
               fmt.Println(err)
       }

}</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

J

Needs chunking to handle huge files. <lang j>require 'strings' NB. not needed for J versions greater than 6. parseFasta=: ((': ' ,~ LF&taketo) , (LF -.~ LF&takeafter));._1</lang> Example Usage <lang j> Fafile=: noun define >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )

  parseFasta Fafile

Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</lang>

Java

Translation of: D

Works with: Java version 7

<lang java>import java.io.*; import java.util.Scanner;

public class ReadFastaFile {

   public static void main(String[] args) throws FileNotFoundException {

       boolean first = true;

       try (Scanner sc = new Scanner(new File("test.fasta"))) {
           while (sc.hasNextLine()) {
               String line = sc.nextLine().trim();
               if (line.charAt(0) == '>') {
                   if (first)
                       first = false;
                   else
                       System.out.println();
                   System.out.printf("%s: ", line.substring(1));
               } else {
                   System.out.print(line);
               }
           }
       }
       System.out.println();
   }

}</lang>

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Rosetta_Example_3: THISISFASTA

jq

Works with: jq version 1.5rc1

The following implementation uses "foreach" and "inputs" so that very large input files can be processed with minimal space requirements: in each cycle, only as many lines are read as are required to compose an output line.
Notice that an additional ">" must be provided to "foreach" to ensure the final block of lines of the input file are properly assembled. <lang jq> def fasta:

 foreach (inputs, ">") as $line
   # state: [accumulator, print ]
   ( [null, null];
     if $line[0:1] == ">" then [($line[1:] + ": "), .[0]]
     else [ (.[0] + $line), false]
     end;
     if .[1] then .[1] else empty end )
   ;

fasta</lang>

Output:

<lang sh>$ jq -n -R -r -f FASTA_format.jq < FASTA_format.fasta Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</lang>

Perl 6

<lang perl6>grammar FASTA {

   rule TOP    { <entry>+ }
   rule entry  { \> <title> <sequence> }
   token title { <.alnum>+ }
   token sequence { ( <.alnum>+ )+ % \n { make $0.join } }

}

FASTA.parse: q:to /§/; >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED §

for $/<entry>[] {

   say ~.<title>, " : ", .<sequence>.made;

}</lang>

Output:

Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Python

I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read, keeping the minimum in memory. <lang python>import io

FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED

infile = io.StringIO(FASTA)

def fasta_parse(infile):

   key = 
   for line in infile:
       if line.startswith('>'):
           if key:
               yield key, val
           key, val = line[1:].rstrip().split()[0], 
       elif key:
           val += line.rstrip()
   if key:
       yield key, val

print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Racket

lang racket

(let loop ([m #t])

 (when m
   (when (regexp-try-match #rx"^>" (current-input-port))
     (unless (eq? #t m) (newline))
     (printf "~a: " (read-line)))
   (loop (regexp-match #rx"\n" (current-input-port) 0 #f
                       (current-output-port)))))

(newline) </lang>

REXX

version 1

This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=$||x
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

Output:

when using the default input file

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

version 2

This REXX version handles (see the talk page):

blank lines
sequences that end in an asterisk [*]
sequences that contain blanks, tabs, and other whitespace
sequence names that are identified with a semicolon [;]

<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if x==  then iterate              /*ignore blank lines.            */
  if left(x,1)==';'  then do
                          if name== then name=substr(x,2)
                          say x
                          iterate
                          end
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=space($||translate(x,,'*'),0)
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

input The FASTA2.IN file is shown below:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Output:

when the FASTA2.IN input file is used

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC
MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK
gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY

Ruby

<lang ruby>def fasta_format(strings)

 out, text = [], ""
 strings.split("\n").each do |line|
   if line[0] == '>'
     out << text unless text.empty?
     text = line[1..-1] + ": "
   else
     text << line
   end
 end
 out << text unless text.empty?

end

data = <<'EOS' >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED EOS

puts fasta_format(data)</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Run BASIC

<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"

i = 1 while i <= len(a$)

 if mid$(a$,i,17) = ">Rosetta_Example_" then 
   print 
   print mid$(a$,i,18);": ";
   i = i + 17
  else
   if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1);
 end if
 i = i + 1

wend</lang>

Output:

>Rosetta_Example_1: THERECANBENOSPACE
>Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Tcl

<lang tcl>proc fastaReader {filename} {

   set f [open $filename]
   set sep ""
   while {[gets $f line] >= 0} {

if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }

   }
   puts ""
   close $f

}

fastaReader ./rosettacode.fas</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

zkl

<lang zkl>fcn fasta(data){ // a lazy cruise through a FASTA file

  fcn(w){      // one string at a time, -->False garbage at front of file
     line:=w.next().strip();
     if(line[0]==">") w.pump(line[1,*]+": ",'wrap(l){
        if(l[0]==">") { w.push(l); Void.Stop } else l.strip()
     })
  }.fp(data.walker()) : Utils.Helpers.wap(_);

}</lang>

This assumes that white space at front or end of string is extraneous (excepting ">" lines).
Lazy, works for objects that support iterating over lines (ie most).
The fasta function returns an iterator that wraps a function taking an iterator. Uh, yeah. An initial iterator (Walker) is used to get lines, hold state and do push back when read the start of the next string. The function sucks up one string (using the iterator). The wrapping iterator (wap) traps the exception when the function waltzes off the end of the data and provides API for foreach (etc).

FASTA file: <lang zkl>foreach l in (fasta(File("fasta.txt"))) { println(l) }</lang> FASTA data blob: <lang zkl>data:=Data(0,String,

  ">Rosetta_Example_1\nTHERECANBENOSPACE\n"
  ">Rosetta_Example_2\nTHERECANBESEVERAL\nLINESBUTTHEYALLMUST\n"
    "BECONCATENATED");

foreach l in (fasta(data)) { println(l) }</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

@@ Line 1: / Line 1: @@
 {{draft task}}
-In [[wp:bioinformatics|bioinformatics]], long character strings are often encoded in a format called [[wp:FASTA format|FASTA]].  A FASTA file can contain several strings, each identified by a name marked by a “<tt>&gt;</tt>” character at the beginning of the line.
+In [[wp:bioinformatics|bioinformatics]], long character strings are often encoded in a format called [[wp:FASTA format|FASTA]].
+A FASTA file can contain several strings, each identified by a name marked by a “<tt>&gt;</tt>” character at the beginning of the line.
 Write a program that reads a FASTA file such as:
@@ Line 11: / Line 12: @@
 BECONCATENATED</pre>
+{{out}}
-And prints the following output:
 <pre>Rosetta_Example_1: THERECANBENOSPACE
 Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</pre>
@@ Line 33: / Line 33: @@
 Gui, show
 return</lang>
+{{out}}
-Outputs:<pre>>Rosetta_Example_1: THERECANBENOSPACE
+<pre>>Rosetta_Example_1: THERECANBENOSPACE
 >Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</pre>
 =={{header|AWK}}==
@@ Line 89: / Line 90: @@
 }
 </lang>
+{{out}}
-<p>Output:</p>
 <pre>
 Rosetta_Example_1: THERECANBENOSPACE
@@ Line 276: / Line 277: @@
 The following implementation uses "foreach" and "inputs"
 so that very large input files can be processed with minimal space requirements:
-in each cycle, only as many lines are read as are required to compose an
+in each cycle, only as many lines are read as are required to compose an output line. <br>
-output line.  Notice that an additional ">" must be provided to "foreach"
+Notice that an additional ">" must be provided to "foreach" to ensure the final block of lines of the input file are properly assembled.
-to ensure the final block of lines of the input file are properly assembled.
 <lang jq>
 def fasta:
@@ Line 323: / Line 323: @@
 =={{header|Python}}==
+I use a string to mimic an input file.
-I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory.
+If it was an input file, then the file is read line-by-line
+and I use a generator expression yielding key, value pairs
+as soon as they are read, keeping the minimum in memory.
 <lang python>import io
@@ Line 388: / Line 391: @@
 if $\==''  then say name':'  $
                                        /*stick a fork in it, we're done.*/</lang>
-'''output''' when using the default input file
+{{out}} when using the default input file
 <pre>
 Rosetta_Example_1: THERECANBENOSPACE
 Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
 </pre>
 ===version 2===
 This REXX version handles (see the ''talk'' page):
@@ Line 444: / Line 448: @@
 IENY
 </pre>
-'''output''' &nbsp; when the <tt> FASTA2.IN </tt> input file is used:
+{{out}} when the <tt> FASTA2.IN </tt> input file is used:
 <pre style="overflow:scroll">
 ;LCBO - Prolactin precursor - Bovine