Posts Tagged 'bioinformatics'

FASTA splitting with BioRuby

In reference to my previous post, here’s the splitter using BioRuby.  Note that I also changed the outer loop to one file per iteration instead of some crazy rules of when to create the file.

#!/usr/bin/env ruby
# Script: dumpseq.rb [file] [N] [prefix]
# Description: Splits a fasta file evenl across N files.  dumps files in the
#              [prefix]  directory
require 'bio'
require 'fileutils'

include Bio

seqs =[0])
ncpus = ARGV[1].to_i
prefix = ARGV[2]

# Remove and hardwire n_seqs if you know beforehand the number of sequences in
# a file.  Saves readtime
n_seqs = 0
seqs.each do |seq|
 n_seqs += 1

overflow = n_seqs % ncpus
split_size = n_seqs / ncpus

ncpus.times do |i|
  filename = sprintf "%s/D%07d/seq%07d.fasta", prefix, i, i
  FileUtils.mkdir_p File.dirname(filename)
  dump =, "w")
  split_size.times do |j|
    dump << seqs.next_entry.to_s
  if i < overflow 
    dump << seqs.next_entry.to_s


Splitting bioinformatics FASTA files

I keep forgetting where my scripts were in my home directories. Below is my ruby script to split a large FASTA [1] sequence into N sequences per file:

#!/usr/bin/env ruby
# Script: dumpseq.rb
# Description: Parses the a BLAST Fasta file and dumps each sequence to a 
#              file.
# Usage: dumpseq.rb [fasta_file]

require 'fileutils'

fasta_db  =[0])

sno = 0
d = 0

file = nil

while true
  x = fasta_db.readline("\n>").sub(/>$/, "")
  x =~ />(.*)\n/
  if sno % 2 == 0 # 2 seqs per query
    file.close if file != nil
    dir = sprintf("D%04d000", d / 1000)
    FileUtils.mkdir_p dir
    # short filenames
    fname = sprintf "SEQ%07d.fasta", d
    d += 1
    file ="#{dir}/#{fname}","w")
  file << x
  sno += 1
  fasta_db.ungetc ?>

Its pretty hackish-looking. But then I found out that BioRuby [2] wrappers for parsing FASTA files.