Splitting bioinformatics FASTA files

I keep forgetting where my scripts were in my home directories. Below is my ruby script to split a large FASTA [1] sequence into N sequences per file:

#!/usr/bin/env ruby
#
# Script: dumpseq.rb
# Description: Parses the a BLAST Fasta file and dumps each sequence to a 
#              file.
# Usage: dumpseq.rb [fasta_file]

require 'fileutils'


fasta_db  = File.new(ARGV[0])

sno = 0
d = 0

file = nil

while true
  x = fasta_db.readline("\n>").sub(/>$/, "")
  x =~ />(.*)\n/
  if sno % 2 == 0 # 2 seqs per query
    file.close if file != nil
    dir = sprintf("D%04d000", d / 1000)
    FileUtils.mkdir_p dir
    # short filenames
    fname = sprintf "SEQ%07d.fasta", d
    d += 1
    file = File.new("#{dir}/#{fname}","w")
  end
  file << x
  sno += 1
  fasta_db.ungetc ?>
end

Its pretty hackish-looking. But then I found out that BioRuby [2] wrappers for parsing FASTA files.

[1] http://en.wikipedia.org/wiki/Fasta
[2] http://www.bioruby.org

Advertisement

1 Response to “Splitting bioinformatics FASTA files”



  1. 1 FASTA splitting with BioRuby « R-squared Trackback on 2009 December 9 at 10:34 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s





Follow

Get every new post delivered to your Inbox.