FASTA format: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:25, 16 April 2014

In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.

Write a program that reads a FASTA file such as:

>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED

And prints the following output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.

AutoHotkey

<lang AutoHotkey>Data = ( >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )

Data := RegExReplace(RegExReplace(Data, ">\V+\K\v+", ": "), "\v+(?!>)") Gui, add, Edit, w700, % Data Gui, show return</lang>

Outputs:

>Rosetta_Example_1: THERECANBENOSPACE
>Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

C++

<lang cpp>#include <iostream>

include <fstream>

int main( int argc, char **argv ){

   if( argc <= 1 ){
       std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
       return -1;
   }

   std::ifstream input(argv[1]);
   if(!input.good()){
       std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
       return -1;
   }

   std::string line, name, content;
   while( std::getline( input, line ).good() ){
       if( line.empty() || line[0] == '>' ){ // Identifier marker
           if( !name.empty() ){ // Print out what we read from the last entry
               std::cout << name << " : " << content << std::endl;
               name.clear();
           }
           if( !line.empty() ){
               name = line.substr(1);
           }
           content.clear();
       } else if( !name.empty() ){
           if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
               name.clear();
               content.clear();
           } else {
               content += line;
           }
       }
   }
   if( !name.empty() ){ // Print out what we read from the last entry
       std::cout << name << " : " << content << std::endl;
   }
   
   return 0;

}</lang>

Output:

Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

D

<lang d>import std.stdio, std.string;

void main() {

   immutable fileName = "fasta_format_data.fasta";

   bool first = true;

   foreach (const line; fileName.File.byLine) {
       if (line[0] == '>') {
           if (first) {
               first = false;
           } else {
               writeln;
           }

           write(line[1 .. $].strip, ": ");
       } else {
           line.strip.write;
       }
   }

   writeln;

}</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Go

<lang go>package main

import (

       "bufio"
       "fmt"
       "os"

)

func main() {

       f, err := os.Open("rc.fasta")
       if err != nil {
               fmt.Println(err)
               return
       }
       defer f.Close()
       s := bufio.NewScanner(f)
       headerFound := false
       for s.Scan() {
               line := s.Text()
               switch {
               case line == "":
                       continue
               case line[0] != '>':
                       if !headerFound {
                               fmt.Println("missing header")
                               return
                       }
                       fmt.Print(line)
               case headerFound:
                       fmt.Println()
                       fallthrough
               default:
                       fmt.Printf("%s: ", line[1:])
                       headerFound = true
               }
       }
       if headerFound {
               fmt.Println()
       }
       if err := s.Err(); err != nil {
               fmt.Println(err)
       }

}</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

J

Needs chunking to handle huge files. <lang j>require 'strings' NB. not needed for J versions greater than 6. parseFasta=: ((': ' ,~ LF&taketo) , (LF -.~ LF&takeafter));._1</lang> Example Usage <lang j> Fafile=: noun define >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED )

  parseFasta Fafile

Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED</lang>

Perl 6

Certainly not the most elegant way to do it, but that's a start: <lang perl6>say "{.[0]}: {.[1]>>.chomp.join}" for ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED".comb: / '>' (\N+)\n (<!before '>'>\N+\n?)+ /, :match</lang>

Python

I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory. <lang python>import io

FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED

infile = io.StringIO(FASTA)

def fasta_parse(infile):

   key = 
   for line in infile:
       if line.startswith('>'):
           if key:
               yield key, val
           key, val = line[1:].rstrip().split()[0], 
       elif key:
           val += line.rstrip()
   if key:
       yield key, val

print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Racket

lang racket

(let loop ([m #t])

 (when m
   (when (regexp-try-match #rx"^>" (current-input-port))
     (unless (eq? #t m) (newline))
     (printf "~a: " (read-line)))
   (loop (regexp-match #rx"\n" (current-input-port) 0 #f
                       (current-output-port)))))

(newline) </lang>

REXX

version 1

This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=$||x
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

output when using the default input file

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

version 2

This REXX version handles (see the talk page):

blank lines
sequences that end in an asterisk [*]
sequences that contain blanks, tabs, and other whitespace
sequence names that are identified with a semicolon [;]

<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if x==  then iterate              /*ignore blank lines.            */
  if left(x,1)==';'  then do
                          if name== then name=substr(x,2)
                          say x
                          iterate
                          end
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=space($||translate(x,,'*'),0)
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

input The FASTA2.IN file is shown below:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

output when the FASTA2.IN input file is used:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC
MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK
gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY

Ruby

<lang ruby>def fasta_format(strings)

 out, text = [], ""
 strings.split("\n").each do |line|
   if line[0] == '>'
     out << text unless text.empty?
     text = line[1..-1] + ": "
   else
     text << line
   end
 end
 out << text unless text.empty?

end

data = <<'EOS' >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED EOS

puts fasta_format(data)</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Run BASIC

<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"

i = 1 while i <= len(a$)

 if mid$(a$,i,17) = ">Rosetta_Example_" then 
   print 
   print mid$(a$,i,18);": ";
   i = i + 17
  else
   if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1);
 end if
 i = i + 1

wend</lang>

Output:

>Rosetta_Example_1: THERECANBENOSPACE
>Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Tcl

<lang tcl>proc fastaReader {filename} {

   set f [open $filename]
   set sep ""
   while {[gets $f line] >= 0} {

if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }

   }
   puts ""
   close $f

}

fastaReader ./rosettacode.fas</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED